I designed a basic packet construction. I may change hash/checksum algorithm, and modify format of some data later. For example, I will update file hash from SHA-2 to SHA-3, when SHA-3 is selected.
The number of Input file slices is limited upto 268,435,456 slices. (2^28)
The number of Total(source & parity) blocks is limited upto 65,536 blocks. (2^16)
The number of Parity blocks is limited upto 32,767 blocks. (2^15-1)
The number of slices in a block is limited upto 16,384 slices. (2^14)
There are max 128 packets for same type data.
Basically, PAR3 file is smaller (around 80%) than PAR2 file. This is because hash size of slice is different, 16-bytes in PAR3 vs 20-bytes in PAR2. As PAR3 can compress filenames, PAR3 file will be smaller when there are many source files. When there are dupe files or dupe slices, PAR3 detects them and omit some redundant data. When a slice contains 1-3 byte pattern, PAR3 detects it and use the information at recovery.
Dupe files/slices are detected at creation time. Input files are distinguished (sorted) by 256-bit hash and 96-bit checksums. Input file slices are compared by 96-bit checksum and 160-bit hash. (127-bit is saved in PAR3 file.)
I made a sample application to write PAR2 & PAR3 index files. To run on 32-bit OS, it supports upto 4,194,304 input file slices. (2^22) This requires memory of "100 * number of slices" bytes or more. If someone is interested in how PAR3 file will become, test with it.
The command to cretae PAR3 index file is like below.
par3j.exe c [options] <par file> [input files]
If you don't want to create PAR3 index file (see size only), set "-in" option. If you want to create both PAR2 & PAR3 index files, set "-in" and "-p2" options.
Some questions to testers:
When you use Multi-Core/CPU, is calculation of hash/checksum balanced ?
It is good to be less difference in "main-thread ?sec" and "total of sub-thread ?sec". At this time, main-thread calculate checksum of slices, and sub-thread read data and calculate hash of files. While MD5 is very fast and balanced with CRC-32 at PAR2, SHA-2 is much slower than CRC-32 & CRC-64. This may be improved later, when hash/checksum algorithm will be changed.
Is "dupe detection" slow ?
When there are more than 1,000,000 slices, it is too slow on my PC. Because this feature is useless in most case like below, I may remove this.
Is it worth to detect "dupe files" ?
This question is same as, are there same files in a set ? Normally I don't put dupe files in a folder, so dupe file detection is useless. I'm not sure this is worth to do by using 1-bit in slice hash. I may remove this, if there is no dupe files in users' real files.
Is it worth to detect "dupe slices" ?
This is a little more useful than "dupe file detection", as partially same data is possible. Even though randam data, when there are many files, last padded slices in each files happen to be same data. For example, when there are 257 files of 4097-bytes and slice size is 4096-bytes, there is at least one dupe slice. But, the chance of dupe slices is very little, almost ignorable in real case. I'm not sure this is worth to do by using 1-bit in slice hash. I may remove this, if there is no dupe slices in users' real files.
Is it worth to detect "byte pattern" ?
When slice size is very small, some slices may contain same data only like, all 0x00 or 0xFF. When slice size is large or source files are compressed, this feature is almost nothing. 2-bytes or 3-bytes pattern are very rare case except sample test data. I'm not sure this is worth to do by using 2-bit in slice hash. Without byte pattern detection, possible number of slices will increased to 2^29 or 2^30. I may remove this, if pattern detection doesn't work for users' real data.
Sample result and description of each section:
In the sample below, there are 10 mpeg video files of 145MB.
Input File count : 10
Input File total size : 152119336
Input File Slice size : 1024
Input File Slice count : 148564
Slice count per block : 3
Source Block count : 49522
Recovery Block max : 16014
Recovery Block count : 0
Redundancy rate : 0.0%
Recovery File count : 0
Slice distribution : 0, uniform
Packet Repetition limit : 0
Replace range : 512
100% : calculate checksum
main-thread 3.568 sec
sub-thread (read) 0.750 sec
sub-thread (hash) 7.860 sec
total time 8.781 sec
file sizes compressed from 80 to 31
File Description data: compressed size = 391, rate = 88.8%
split num = 1, split size = 391
dupe file = 0, dupe slice = 949
1-byte pattern = 2, 2-byte pattern = 0, 3-byte pattern = 0
Slice Checksum data: compressed size = 2365636, rate = 99.5%
split num = 128, split size = 18482
size of non-compressed filenames = 120
size of compressed filenames = 34
Filename data: compressed size = 36, rate = 28.1%
split num = 1, split size = 36
dupe detection 1.156 sec
filename compression 0.000 sec
PAR3 index file size = 2370933
By setting 1KB slice size, there are 148,564 slices. By putting 3 slices in a block, there are 49,522 blocks.
In calculating hash/checksum, file hash (SHA-2) is double slower than slice checksum (CRC-32&64). This multi-threading doesn't use CPU's full power. (main-thread need to wait the end of sub-thread.)
Files sizes (8-byte integer * 10) are compressed, but it is small data already.
Though the mpeg files are all different, there are same 1KB data somewhere. The rate of dupe is 949/148,564 = 0.639%.
Because dupe slices are few, compression doesn't work so much. But this is still better than normal compression algorithm like LZMA. Hash/Checksum is almost random data and is very hard to compress.
Filenames are compressed by PPMd, and become very small. But this doesn't make much difference, when file names are short.
PAR3's index file size is roughly predictable as "number of slices * 16". 148,564 * 16 = 2,377,024 and this is close to actual file size.