Live Business Chat
2015 Sep 06, 12:56:23 pm *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: This is a friendly non-profit discussion group about making money. Click the "NOTIFY" button every chance you get to receive instant alerts about new information. We don't like spam. If your first posts are about something you want from us, it is probably spam, and you will be banned in forums worldwide. No soliciting! If you have problems with the forum, contact badon in the business chat.
 
   Home   Help Search Calendar Login Register  
Pages: [1]   Go Down
  Print  
Author Topic: Some problems for next PAR design  (Read 35 times)
0 Members and 1 Guest are viewing this topic.
Yutaka Sawada
Moderator
Capitalist Pig
*****
Offline Offline

Posts: 670


« on: 2015 Aug 28, 06:57:31 pm »

 There are some un-solved problems in PAR2 (or future PAR3). I write some of them here. If someone has good idea or possible solution, we can discuss.


1) A PAR2 file cannot recover (or reconstruct, repair) the original PAR2 files themselves.

 While PAR2 specification defines PAR2 packets, it doesn't define the order or number of packets in a PAR2 file. So, it's impossible to determine the original state of a damaged PAR2 file. For example, QuickPar and MultiPar make different PAR2 files from same source files, even when they are compatible at packet level. This is why MultiPar (and QuickPar) has no feature to repair PAR2 files, while it can re-create new/extra PAR2 files for the same source files.


2) Order of source files (source blocks) should be independent from filename.

 PAR2 cannot follow the change of filename, because the order of source files depends on the filename. When filename is changed, (even when source files are not changed), the order of source blocks is changed, and it needs to re-calculate recovery blocks again.

 To solve this problem, I plan to use hash value of files to determine the order of source files. But, it causes another problem; it needs to calculate all files' hash values to determine the order, and it requires longer time than using filename.


3) I cannot understand the algorithm of new Reed-Solomon Codes for PAR3.

 This is just a lack of my talent, hehe. While I like simple calculation, those mathematical theory are too high level... Though persicum helped me very much to understand mathematics ago, he went to other field now. If someone can understand these papars below, I will ask some questions for the implementation.

"Efficient erasure decoding of Reed-Solomon codes" by Frederic Didier
"Fast Ecoding/Decoding Algorithms for Reed-Solomon Erasure Codes" by Sian-Jheng Line, Wei-Ho Chung
"Novel Polynomial Basis and Its Application to Reed-Solomon Erasure Codes" by Sian-Jheng Line, Wei-Ho Chung, Yunghsiang S. Han
"Efficient Frequency-Domain Decoding Algorithms for Reed-Solomon Codes" by Sian-Jheng Lin, Tareq Y. Al-Naffouri, Yunghsiang S. Han


4) Some Error Correction Codes cannot add parity blocks later.

 In PAR1/PAR2 (James Plank style's Reed-Solomon Erasure Codes), it's possible to add extra parity blocks after creating original parity blocks. This is impossible in some kinds of classic Reed-Solomon Error Correction Codes. While I cannot understand the reason or solution, it seems to be caused by the new theory for speed. The difference may become a problem, when that algorithm is selected.


5) How to detect partially damaged blocks by using hash values.

 A source file is splited into multiple slices, and each slice becomes a source block in PAR2. Because PAR2 has hash value of each slice, it's possible to find complete slices in a source file. But, it cannot locate a damaged slice, even when the damage is only one byte. This becomes a problem, when a user wants to re-join some splited files, which contain a few damage. If there is enough parity blocks, it's possible to re-join by recovering damaged blocks. If it lacks parity blocks, it's impossible to re-join them.

 If PAR3 contains multiple hash values for each slice, it may recognize a damaged slice. But, it will require more space for additional hash values.


6) Change or addition of source file requires recreate of whole recovery files.

 This is a bad point of FEC (Foward Error Correction). Even when only one small file is changed, there is no short cut like differencial backup. So, backup is good for frequently updated files. In some algorithm like LDPC or Fountain Codes, it may be possible to update small part of recovery files. To recreate faster, Creating time should be less than Recovering time.

 In this point of view, Plank style's Reed-Solomon Erasure Codes has a problem, because the time of Creation and Recovery are basically same. (for many lost blocks, recovery requires more time to invert matrix.) An example of different time is LDPC; creating time is much shorter than recovering time. While LDPC's creation is miracle fast, the recovery was sometimes slower than Reed-Solomon Codes. It might depends on decoding algorithm.


7) Compession of filename

 When filenames are long or include same sub-directory, compressing them may help efficiency on disk space. For example, RSC32 compress filenames by delta encoding. The idea of filename compession is bad to change filename later, because single renaming may affect all compressed filenames. When the order of source files is independent from their filename, it's difficult to compress them. (it should compress filenames without knowing the order.)


8) Fragmented file IO

 When file data is larger than memory size, an application needs to keep and process partial file data at a time. If an ECC algorithm requires all source data at once, more blocks causes more fragmented file access. This may cause serious slow down on HDD. (there is no problem on SSD.)

 For example, 2GB buffer is used to load 6GB file data. When there are 2000 source blocks, the block size is 3MB. Then, each block is splited to 1MB, and load 3 times. The file access is like; reading 1MB of a block, skip next 2MB, read 1MB of next block, skip... The number of file access becomes 2000 * 3 = 6000 times. When there are 20000 source blocks, the block size is 300KB. Then, each block is splited to 100KB, and load 3 times. The file access is like; reading 100KB of a block, skip next 200KB, read 100KB of next block, skip... The number of file access becomes 20000 * 3 = 60000 times. When there are 200000 source blocks, the block size is 30KB. Then, each block is splited to 10KB, and load 3 times. The file access is like; reading 10KB of a block, skip next 20KB, read 10KB of next block, skip... The number of file access becomes 200000 * 3 = 600000 times.

 Though this isn't a serious problem for PAR1/PAR2 (blocks are not so many), it may not for PAR3. As a solution for the problem, RSC32 uses some temorary files to re-arrange many blocks on source files. While it's simple and fast at many blocks, it requires additional disk space (same size as source files).
Logged
Nyan
Dreamer
*
Offline Offline

Posts: 5


« Reply #1 on: 2015 Aug 29, 06:05:58 pm »

I'm very new to understanding erasure codes, in fact I only started reading about them a few weeks ago, so please excuse my newbishness.

1) How important is this?
A smart program could probably recover a PAR2 file if the packet headers are not corrupted. Even if not, using the same parameters, a different program could still make PAR2 volumes that are compatible (although not bit-for-bit exact).

Allowing some differences may be useful, as different programs can have different needs.


2) The PAR2 specifications require that the file IDs be sorted. If we remove that restriction, then you could rename files without having to re-create the recovery.

Does it need to be ordered in a certain way? If #1 above is important, then maybe so.


7) The obvious solution is to break the compression into pieces. So when you update something, you only need to recompress the piece. If the piece happens to grow larger, you may need to rewrite the file to push everything down though.


Cool It can still be a problem on SSD because SSDs are optimal with page size reads (e.g. 4KB at a time). Even if SSD doesn't suffer from bad seek times, there are still overheads in the SATA protocol which makes it slower to do random reads (and writes, since the recovery data needs to be written somewhere) than sequential read/write.

If cost of seeking is high, a program can just do a sequential read and generate fewer recovery blocks at a time. The downside of this is that multiple read passes are needed.
A hybrid approach could work to make it less bad. For example, if the program uses an optimal block size for reading.

Somehow rearranging (transposing?) sounds good, but I don't like the idea of temporary files.
Logged
Yutaka Sawada
Moderator
Capitalist Pig
*****
Offline Offline

Posts: 670


« Reply #2 on: 2015 Aug 31, 09:52:25 pm »

Quote from: Nyan
so please excuse my newbishness.

 No problem. These listed items are challenge for next age parchive. (some of them won't be solved yet in PAR3.) You can make your own PAR2 client without considering them.


Quote
1) How important is this?

 The problem isn't so serious, because he can recreate PAR2 files again. Only the difference is that, recreating a whole damaged PAR2 file requires longer time than recreating a damaged or lost packet. Some users requested Repair of PAR2 files ago. Technically it's possible to reconstruct a damaged PAR2 file, if the construction (order of packets) is known. But, I was lazy to search and implement.


Quote
Does it need to be ordered in a certain way?

 Yes, files must be ordered strictly. This is important for compatibility. When a set of PAR files is created from a same set of source files with same setting, the resulting PAR files must be compatible.

 For example, when you got files and PAR files from a friend, they were damaged and has not enough redundancy to repair. You can ask to send more PAR files to become enough redundancy, because you know that incomming other PAR files are compatible with current PAR2 files. It's important that Recreated (or Additional) PAR files are compatible with Original PAR files.
Logged
Nyan
Dreamer
*
Offline Offline

Posts: 5


« Reply #3 on: 2015 Sep 04, 03:14:08 am »

The problem isn't so serious, because he can recreate PAR2 files again. Only the difference is that, recreating a whole damaged PAR2 file requires longer time than recreating a damaged or lost packet. Some users requested Repair of PAR2 files ago. Technically it's possible to reconstruct a damaged PAR2 file, if the construction (order of packets) is known.
I didn't think that'd be a common use case, but fair point.
It should already be possible to repair a PAR2 file if we assume the packet headers are not sufficiently damaged (check MD5 of packet).
It may even be possible to repair a file with damaged packet headers, if you're willing to scan for the signature, but I can see that having a strict structure makes things easier and more reliable.

For example, when you got files and PAR files from a friend, they were damaged and has not enough redundancy to repair. You can ask to send more PAR files to become enough redundancy, because you know that incomming other PAR files are compatible with current PAR2 files. It's important that Recreated (or Additional) PAR files are compatible with Original PAR files.
Good point. It sounds like the implementations just need to be compatible though, not necessarily sorted.
Sorting brings in some problems:
- file name based ordering doesn't like renaming
- hash based ordering can be expensive
- neither work well with adding/removing files to an already created parchive

Since all you have for a file is its file name and contents, you can't sort by anything else.

Generating additional parity is a challenge, but should be possible if you have at least an index of sorts (this details the internal ordering used). I can't think of a way to do this without an index however.

---

In my view, a feature that would be nice to have would be streaming. That is, you can create parity by streaming source data through, provided you have enough RAM. Difficult to do though, and does require your point 8 above.

Thank you for your reply Smiley
Logged
Pages: [1]   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!
Page created in 0.03 seconds with 19 queries.