compression tests 2021+

I collect a few GB/day of logfiles from my ADSB antennae. Storing and processing these is mostly automated so gzip has been adequate.

Looking to improve I've tested a few newer methods, LZMA (xz), lz4, and ZSTD. Both support threading -T0 so they should be able to outperform the baseline gzip.

2024 update

lz4 and zstd have benchmark modes! Both can compress and decompress gigabytes per second. lz4 is particularly fast but with much lower compression ratios. Threading in zstd is also quite fast but makes the single thread of lz4 even more impressive.

zstd -T0 benchmark:

 1#gcat-20210314.log :2713597522 -> 240069090 (x11.30), 9057.1 MB/s, 2839.3 MB/s
 2#gcat-20210314.log :2713597522 -> 241187356 (x11.25), 8020.8 MB/s, 2682.4 MB/s
 3#gcat-20210314.log :2713597522 -> 238182761 (x11.39), 6223.0 MB/s, 2824.2 MB/s
 4#gcat-20210314.log :2713597522 -> 238218499 (x11.39), 5486.5 MB/s, 2826.7 MB/s
 5#gcat-20210314.log :2713597522 -> 212888534 (x12.75), 2089.2 MB/s, 2944.4 MB/s
 6#gcat-20210314.log :2713597522 -> 199991356 (x13.57), 1674.5 MB/s, 3405.0 MB/s
 7#gcat-20210314.log :2713597522 -> 196728704 (x13.79), 1472.0 MB/s, 3527.5 MB/s
 8#gcat-20210314.log :2713597522 -> 190718403 (x14.23), 1227.1 MB/s, 3585.0 MB/s
 9#gcat-20210314.log :2713597522 -> 191394435 (x14.18), 1159.9 MB/s, 3509.9 MB/s
10#gcat-20210314.log :2713597522 -> 188619770 (x14.39),  837.7 MB/s, 3630.1 MB/s
11#gcat-20210314.log :2713597522 -> 184256306 (x14.73),  577.4 MB/s, 3703.4 MB/s
12#gcat-20210314.log :2713597522 -> 184277158 (x14.73),  560.9 MB/s, 3693.2 MB/s
13#gcat-20210314.log :2713597522 -> 183289229 (x14.81),  339.8 MB/s, 3707.2 MB/s
14#gcat-20210314.log :2713597522 -> 180666285 (x15.02),  218.3 MB/s, 3837.3 MB/s
15#gcat-20210314.log :2713597522 -> 179701035 (x15.10),  110.8 MB/s, 3883.5 MB/s
16#gcat-20210314.log :2713597522 -> 174521836 (x15.55),   80.2 MB/s, 4042.0 MB/s
17#gcat-20210314.log :2713597522 -> 159367407 (x17.03),   59.2 MB/s, 3838.0 MB/s
18#gcat-20210314.log :2713597522 -> 152623233 (x17.78),   33.9 MB/s, 3839.5 MB/s
19#gcat-20210314.log :2713597522 -> 142108593 (x19.10),   18.4 MB/s, 3774.2 MB/s

zstd benchmark (threading off)

 1#gcat-20210314.log :2713597522 -> 240115907 (x11.30), 1161.7 MB/s, 2803.9 MB/s
 2#gcat-20210314.log :2713597522 -> 241293542 (x11.25), 1151.8 MB/s, 2684.4 MB/s
 3#gcat-20210314.log :2713597522 -> 238254896 (x11.39), 1078.6 MB/s, 2785.5 MB/s
 4#gcat-20210314.log :2713597522 -> 238282005 (x11.39), 1031.4 MB/s, 2783.6 MB/s
 5#gcat-20210314.log :2713597522 -> 212869438 (x12.75),  293.8 MB/s, 2958.8 MB/s
 6#gcat-20210314.log :2713597522 -> 200145596 (x13.56),  230.2 MB/s, 3380.1 MB/s
 7#gcat-20210314.log :2713597522 -> 197033080 (x13.77),  205.4 MB/s, 3500.4 MB/s
 8#gcat-20210314.log :2713597522 -> 191047715 (x14.20),  168.4 MB/s, 3546.9 MB/s
 9#gcat-20210314.log :2713597522 -> 191700892 (x14.16),  168.4 MB/s, 3553.8 MB/s
10#gcat-20210314.log :2713597522 -> 188884663 (x14.37),  123.6 MB/s, 3657.4 MB/s
11#gcat-20210314.log :2713597522 -> 184457109 (x14.71),   89.6 MB/s, 3741.3 MB/s
12#gcat-20210314.log :2713597522 -> 184516421 (x14.71),   88.2 MB/s, 3717.3 MB/s
13#gcat-20210314.log :2713597522 -> 183534342 (x14.79),   63.8 MB/s, 3715.5 MB/s
14#gcat-20210314.log :2713597522 -> 180909457 (x15.00),   42.0 MB/s, 3848.3 MB/s
15#gcat-20210314.log :2713597522 -> 179848184 (x15.09),   26.3 MB/s, 3908.1 MB/s
16#gcat-20210314.log :2713597522 -> 175286553 (x15.48),   14.6 MB/s, 4117.1 MB/s
17#gcat-20210314.log :2713597522 -> 158725989 (x17.10),   11.8 MB/s, 3942.7 MB/s
18#gcat-20210314.log :2713597522 -> 152608351 (x17.78),   6.65 MB/s, 3934.4 MB/s

lz4 benchmark:

➜  /tmp lz4 -b1 -e9 flugcat-20210314.log
Benchmarking levels from 1 to 9
File(s) bigger than LZ4's max input size; testing 2016 MB only...
 1#gcat-20210314.log :2113929216 -> 336994834 (6.273),2217.7 MB/s ,6415.4 MB/s
 2#gcat-20210314.log :2113929216 -> 336994834 (6.273),2217.6 MB/s ,6412.2 MB/s
 3#gcat-20210314.log :2113929216 -> 225376163 (9.380), 360.4 MB/s ,8621.7 MB/s
 4#gcat-20210314.log :2113929216 -> 218477309 (9.676), 268.8 MB/s ,8725.0 MB/s
 5#gcat-20210314.log :2113929216 -> 214023797 (9.877), 188.9 MB/s ,8781.8 MB/s
 6#gcat-20210314.log :2113929216 -> 211747950 (9.983), 132.0 MB/s ,8869.8 MB/s
 7#gcat-20210314.log :2113929216 -> 210429674 (10.046),  92.3 MB/s ,8944.9 MB/s
 8#gcat-20210314.log :2113929216 -> 209446711 (10.093),  64.2 MB/s ,8960.4 MB/s
 9#gcat-20210314.log :2113929216 -> 208953745 (10.117),  46.8 MB/s ,8987.7 MB/s

compression algorithms

raw info

All tests done with a Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz with other random usage.

Results could become more accurate with echo 1 > /proc/sys/vm/drop_caches and less background io.

flugcat-20210314.log:

  • 2713597522 bytes

zstd:

  • zstd flugcat-20210314.log -o flugcat-20210314.log.3.zst 6.09s user 0.66s system 114% cpu 5.881 total
  • 8.95% 243135780 bytes
  • unzstd flugcat-20210314.log.3.zst 2.24s user 1.58s system 99% cpu 3.833 total

zstd - -T0 -9:

  • zstd -T0 -9 flugcat-20210314.log -o flugcat-20210314.log.9.zst 39.54s user 0.99s system 339% cpu 11.948 total
  • 7.00% 190202201 bytes
  • unzstd flugcat-20210314.log.9.zst 1.87s user 1.52s system 80% cpu 4.201 total

zstd - -T0 -19 (max):

  • zstd -T0 -19 flugcat-20210314.log -o flugcat-20210314.log.19.zst 2442.36s user 3.44s system 352% cpu 11:34.70 total
  • 5.25% 142548948 bytes
  • unzstd flugcat-20210314.log.19.zst 1.81s user 1.49s system 99% cpu 3.306 total

xz:

  • xz flugcat-20210314.log 971.35s user 2.50s system 99% cpu 16:15.41 total
  • 5.20% 141154572 bytes
  • unxz flugcat-20210314.log.xz 11.77s user 1.52s system 41% cpu 31.900 total

xz - -T0:

  • xz -T0 flugcat-20210314.log 1107.33s user 2.52s system 343% cpu 5:23.36 total
  • 5.24% 142435176 bytes
  • unxz flugcat-20210314.log.xz 11.58s user 1.71s system 62% cpu 21.346 total

gzip:

  • gzip flugcat-20210314.log 37.42s user 0.98s system 99% cpu 38.495 total
  • 8.03% 217945296 bytes
  • gunzip flugcat-20210314.log.gz 8.84s user 1.26s system 58% cpu 17.378 total

flugcat log processing

Grep through the messages and pull out the ones with flight info; MSG,1.

uncompressed:

✗ time grep ^MSG,1 flugcat-20210314.log | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} ^MSG,1   2.55s user 0.42s system 87% cpu 3.404 total
awk -F , '{ print $5","$11 }'  0.88s user 0.03s system 26% cpu 3.400 total
sort  0.53s user 0.04s system 16% cpu 3.518 total
uniq > /dev/null  0.03s user 0.00s system 0% cpu 3.516 total

zstd:

✗ time zstdgrep ^MSG,1 flugcat-20210314.log.zst | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
zstdgrep ^MSG,1 flugcat-20210314.log.zst  4.60s user 1.06s system 156% cpu 3.630 total
awk -F , '{ print $5","$11 }'  0.90s user 0.10s system 27% cpu 3.629 total
sort  0.54s user 0.03s system 15% cpu 3.749 total
uniq > /dev/null  0.03s user 0.00s system 0% cpu 3.748 total

xz:

✗ time xzgrep ^MSG,1 flugcat-20210314.log.xz | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
xzgrep ^MSG,1 flugcat-20210314.log.xz  15.11s user 2.13s system 128% cpu 13.392 total
awk -F , '{ print $5","$11 }'  0.97s user 0.10s system 7% cpu 13.391 total
sort  0.53s user 0.05s system 4% cpu 13.522 total
uniq > /dev/null  0.03s user 0.00s system 0% cpu 13.522 total

gzip:

 ✗ time zgrep ^MSG,1 flugcat-20210314.log.gz | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
zgrep ^MSG,1 flugcat-20210314.log.gz  12.64s user 1.88s system 140% cpu 10.363 total
awk -F , '{ print $5","$11 }'  1.03s user 0.06s system 10% cpu 10.363 total
sort  0.56s user 0.02s system 5% cpu 10.484 total
uniq > /dev/null  0.02s user 0.01s system 0% cpu 10.483 total

conclusions

ZSTD works really well and I agree with Arch linux deciding to use it in their packaging.

Personally, I'll be using -T0 -9 to save some space (1%) vs gzip but still reasonably quick to compress.