Live Business Chat (LBC)
2016 Aug 28, 07:59:13 am *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: The MCC list here at LBC has moved to badon's blog at the CC Forum (CCF).
   Home   Help Search Calendar Login Register  
Pages: [1]   Go Down
Author Topic: possibility of new PAR3 file  (Read 4080 times)
0 Members and 1 Guest are viewing this topic.
Yutaka Sawada
Capitalist Pig
Offline Offline

Posts: 830

« on: 2012 Jun 27, 06:10:45 pm »

 I designed a basic packet construction. I may change hash/checksum algorithm, and modify format of some data later. For example, I will update file hash from SHA-2 to SHA-3, when SHA-3 is selected.

The number of Input file slices is limited upto 268,435,456 slices. (2^28)
The number of Total(source & parity) blocks is limited upto 65,536 blocks. (2^16)
The number of Parity blocks is limited upto 32,767 blocks. (2^15-1)
The number of slices in a block is limited upto 16,384 slices. (2^14)
There are max 128 packets for same type data.

 Basically, PAR3 file is smaller (around 80%) than PAR2 file. This is because hash size of slice is different, 16-bytes in PAR3 vs 20-bytes in PAR2. As PAR3 can compress filenames, PAR3 file will be smaller when there are many source files. When there are dupe files or dupe slices, PAR3 detects them and omit some redundant data. When a slice contains 1-3 byte pattern, PAR3 detects it and use the information at recovery.

 Dupe files/slices are detected at creation time. Input files are distinguished (sorted) by 256-bit hash and 96-bit checksums. Input file slices are compared by 96-bit checksum and 160-bit hash. (127-bit is saved in PAR3 file.)

 I made a sample application to write PAR2 & PAR3 index files. To run on 32-bit OS, it supports upto 4,194,304 input file slices. (2^22) This requires memory of "100 * number of slices" bytes or more. If someone is interested in how PAR3 file will become, test with it.

 The command to cretae PAR3 index file is like below.
par3j.exe c [options] <par file> [input files]

If you don't want to create PAR3 index file (see size only), set "-in" option. If you want to create both PAR2 & PAR3 index files, set "-in" and "-p2" options.

 Some questions to testers:

When you use Multi-Core/CPU, is calculation of hash/checksum balanced ?
 It is good to be less difference in "main-thread ?sec" and "total of sub-thread ?sec". At this time, main-thread calculate checksum of slices, and sub-thread read data and calculate hash of files. While MD5 is very fast and balanced with CRC-32 at PAR2, SHA-2 is much slower than CRC-32 & CRC-64. This may be improved later, when hash/checksum algorithm will be changed.

Is "dupe detection" slow ?
 When there are more than 1,000,000 slices, it is too slow on my PC. Because this feature is useless in most case like below, I may remove this.

Is it worth to detect "dupe files" ?
 This question is same as, are there same files in a set ? Normally I don't put dupe files in a folder, so dupe file detection is useless. I'm not sure this is worth to do by using 1-bit in slice hash. I may remove this, if there is no dupe files in users' real files.

Is it worth to detect "dupe slices" ?
 This is a little more useful than "dupe file detection", as partially same data is possible. Even though randam data, when there are many files, last padded slices in each files happen to be same data. For example, when there are 257 files of 4097-bytes and slice size is 4096-bytes, there is at least one dupe slice. But, the chance of dupe slices is very little, almost ignorable in real case. I'm not sure this is worth to do by using 1-bit in slice hash. I may remove this, if there is no dupe slices in users' real files.

Is it worth to detect "byte pattern" ?
 When slice size is very small, some slices may contain same data only like, all 0x00 or 0xFF. When slice size is large or source files are compressed, this feature is almost nothing. 2-bytes or 3-bytes pattern are very rare case except sample test data. I'm not sure this is worth to do by using 2-bit in slice hash. Without byte pattern detection, possible number of slices will increased to 2^29 or 2^30. I may remove this, if pattern detection doesn't work for users' real data.

 Sample result and description of each section:
In the sample below, there are 10 mpeg video files of 145MB.

Input File count        : 10
Input File total size   : 152119336
Input File Slice size   : 1024
Input File Slice count  : 148564
Slice count per block   : 3
Source Block count      : 49522
Recovery Block max      : 16014
Recovery Block count    : 0
Redundancy rate         : 0.0%
Recovery File count     : 0
Slice distribution      : 0, uniform
Packet Repetition limit : 0
Replace range           : 512

100% : calculate checksum
main-thread             3.568 sec
sub-thread (read)       0.750 sec
sub-thread (hash)       7.860 sec
total time              8.781 sec

file sizes compressed from 80 to 31
File Description data: compressed size = 391, rate = 88.8%
split num = 1, split size = 391

dupe file = 0, dupe slice = 949
1-byte pattern = 2, 2-byte pattern = 0, 3-byte pattern = 0

Slice Checksum data: compressed size = 2365636, rate = 99.5%
split num = 128, split size = 18482

size of non-compressed filenames = 120
size of compressed filenames = 34
Filename data: compressed size = 36, rate = 28.1%
split num = 1, split size = 36

dupe detection          1.156 sec
filename compression    0.000 sec
PAR3 index file size = 2370933

1st section:
By setting 1KB slice size, there are 148,564 slices. By putting 3 slices in a block, there are 49,522 blocks.

2nd section:
In calculating hash/checksum, file hash (SHA-2) is double slower than slice checksum (CRC-32&64). This multi-threading doesn't use CPU's full power. (main-thread need to wait the end of sub-thread.)

3rd section:
Files sizes (8-byte integer * 10) are compressed, but it is small data already.

4th section:
Though the mpeg files are all different, there are same 1KB data somewhere. The rate of dupe is 949/148,564 = 0.639%.

5th section:
Because dupe slices are few, compression doesn't work so much. But this is still better than normal compression algorithm like LZMA. Hash/Checksum is almost random data and is very hard to compress.

6th section:
Filenames are compressed by PPMd, and become very small. But this doesn't make much difference, when file names are short.

7th section:
PAR3's index file size is roughly predictable as "number of slices * 16". 148,564 * 16 = 2,377,024 and this is close to actual file size.

* par3j_new_2012-06-25.7z (61.42 KB - downloaded 216 times.)
Yutaka Sawada
Capitalist Pig
Offline Offline

Posts: 830

« Reply #1 on: 2012 Jul 15, 10:06:55 pm »

 I modified new PAR3 specification a bit to smaller and simpler. I omitted detection of "byte pattern" from new PAR3 idea. It will detect all the same bytes in a slice as "uniform slice". In a file format, the theoretical limit number of input file slices is removed. (the max will be extended by keeping compatibility, when future PC has several TB ram.) As a real application over current PC, there is a limit still. (2^22 maybe max on 32-bit OS) I joined two packet types "File Description packet" and "Slice Checksum packet", because they will change synchronously. When file data is changed, the file's hash and checksums of the slices in the file are affected in same time. This may let a number of packets to be one less.

* par3j_new_2012-07-14.7z (61.24 KB - downloaded 175 times.)
Yutaka Sawada
Capitalist Pig
Offline Offline

Posts: 830

« Reply #2 on: 2012 Aug 25, 06:03:17 pm »

 I modified new PAR3 specification again. Generator polynomial of CRC-128 is changed. Max split number at containing in multiple packets becomes 64 instead of 128 ago. (this change makes less packets for large data.) Checksum and size of File Description data are saved in Main packet instead of each File Description packets. (this change makes smaller File Description packets.) Recovery Set ID is calculated as a CRC-32 of the body of Main packet. (this is simple and easy to check integrity of packets.)

 I need to design how to make parity packet group. When File Description data is large and there are many Recovery packets, 192 parity (File Description) packets will be made from 64 original File Description packets. These parity packets are used as repeated packets. Though PAR2 repeats same packets in a PAR2 file, PAR3 will repeat parity packets in a PAR3 file. I may use simple 8-bit Reed-Solomon Erasure Codes like PAR1 for those parity packets.

* par3j_new_2012-08-24.7z (61.31 KB - downloaded 128 times.)
Yutaka Sawada
Capitalist Pig
Offline Offline

Posts: 830

« Reply #3 on: 2012 Sep 19, 10:24:21 pm »

 I modified new PAR3 specification a bit. The number of split is saved instead of total size. This change will make slightly smaller PAR files. (only several bytes) I wrote a description of how to repeat some groups of packets. The packet repetition becomes less than PAR2, because stability of packets is improved. As a container of packets, PAR3 avoids too large packets and too many packets.

 For debug, setting /lp2 or /lp3 option creates additional index file with extra vital packet group.

 Now, I need to design the most important part. It is the construction of Recovery Slice packet. I have some interesting idea and want to test them.

* par3j_new_2012-09-20.7z (63.05 KB - downloaded 146 times.)
Serious Business
Offline Offline

Posts: 144

« Reply #4 on: 2012 Sep 20, 07:16:34 am »

I believe that PAR3 is a myth...
Pages: [1]   Go Up
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!
Page created in 0.11 seconds with 18 queries.