I'll read this thoroughly and document what has been learned.
When I tested my old samples on my new PC, I found that I had mis-understood algorithm's complexity and I might write wrong information somewhere. I correct my statement and put result for others who think about ECC algorithm.
When data size is same, Encoding/Decoding time of PAR2 depends on the block size and number of blocks. The complexity is "block size" * "source block count" * "parity (or lost) block count". So, if you set two times larger block size, it requires double more encoding time. If you set half block size, the speed becomes double faster. I mis-understood the calculation cost.
Now, using many "Cohort" improve speed by the number of cohorts. If there are 100 cohorts, the speed will become 100 times faster. Even though the recovering capability may be less, the calculation speed would be a merit in some cases. So, the "Cohort" idea of JDlugosz is reasonable for PAR2 style algorithm.
I tested 4 ECC algorithms. James Plank style Reed-Solomon Erasure Codes is the algorithm in PAR2. Because I could not test SSE2 on my old PC, I compiled FWT version on new PC. While I cannot understand the FWT so much, I didn't optimize the code. (It should be double faster with AVX.) FFT and LDPC were compiled on my old PC, and don't use SSE. I didn't test 32-bit RS with FFT, as mine is much slower than the one in RSC32. (the speed of RSC32 would be similar level as 16-bit RS with FFT.)
Calculation time with single core of Celeron G1620 2.7 GHz
Data size = 468.75 MB
= 16384 * 30000
= 49152 * 10000
= 163840 * 3000
= 491520 * 1000
= 1638400 * 300
= 4915200 * 100
Redundancy = 20%
James Plank style 16-bit Reed-Solomon Erasure Codes (used in PAR2)
Data |4.69MB * 100|1.56MB * 300|480KB * 1000|160KB * 3000|48KB * 10000|16KB * 30000|
no-SIMD| 4.9 sec | 15.1 sec | 50.5 sec | 152.1 sec | 511.1 sec | 1559.2 sec |
MMX | 4.5 sec | 13.7 sec | 45.9 sec | 138.2 sec | 464.9 sec | 1424.7 sec |
SSSE3 | 2.7 sec | 8.3 sec | 27.6 sec | 81.4 sec | 265.6 sec | 803.5 sec |
Frederic Didier's 16-bit Reed-Solomon Erasure Codes with FWT
Data |4.69MB * 100|1.56MB * 300|480KB * 1000|160KB * 3000|48KB * 10000|16KB * 30000|
no-SIMD| 170.4 sec | 242.9 sec | 312.8 sec | 222.5 sec | 283.4 sec | 397.9 sec |
SSE2 | 24.9 sec | 35.7 sec | 47.9 sec | 40.1 sec | 51.4 sec | 70.4 sec |
Michael Niedermayer style 16-bit Reed-Solomon Codes with FFT (Error Correction requires double or more time.)
Data |4.69MB * 100|1.56MB * 300|480KB * 1000|160KB * 3000|48KB * 10000|16KB * 30000|
no-SIMD| 8.7 sec | 15.0 sec | 21.0 sec | 20.2 sec | 23.2 sec | 36.4 sec |
LDPC-Staircase style LDGM (MMX, Require 24% redundancy to recover 20% lost.)
Data |4.69MB * 100|1.56MB * 300|480KB * 1000|160KB * 3000|48KB * 10000|16KB * 30000|
Encode | 0.4 sec | 0.4 sec | 0.4 sec | 0.3 sec | 0.4 sec | 0.4 sec |
Decode | 0.4 sec | 0.5 sec | 0.5 sec | 0.7 sec | 1.7 sec | 3.9 sec |
I confirmed the speed merit of "Cohort" system for James Plank style Reed-Solomon Erasure Codes. Putting 3000 blocks in each cohort (total 10 cohorts) is 10 times faster than putting 30000 blocks in single cohort. Putting 300 blocks in each cohort (total 100 cohorts) is 100 times faster than putting 30000 blocks in single cohort. On the other hand, the minimun recoverying capability is 10 (or 100) times less.
But, "Cohort" system doesn't work well for other faster algorithms. Though the speed of FFT or FWT depends on the number of blocks, the difference is small. Using 10 or 100 cohorts will improve speed by only 2 times. The effect of "Cohort" becomes bad trade off, when the algorithm has new technology. Then, Mr. persicum and Michael Niedermayer might state that interleaving (Cohort) was worthless.
By the way, the speed of LDPC on new PC is unbelievable quick (almost instant). It seems that LDPC's speed depends on memory speed largely. Basically the LDPC algorithm is similar to "Cohort" idea. In the sample implementation, there are 4 nodes per each source block. 24% redundancy for 30000 source blocks gives 7200 parity blocks (Cohorts), and each parity block has 30000 * 4 / 7200 = 17 nodes (Stripes). The recoverying capability of LDPC depends on the Generator Matrix and possible damage model.
I'm interested in new Reed-Solomon Codes by Sian-Jheng Lin. If the algorithm perfomes real speed as written in his papar, it will become the fastest (long code size) Reed-Solomon Codes on the world. But, I didn't test it yet, as I could not implement, hehe. I should ask him to give me something sample code...