compression tests 2021

I collect a few GB/day of logfiles from my ADSB antennae. Storing and processing these is mostly automated so gzip has been adequate.

Looking to improve I've tested a few newer methods, LZMA (xz) and ZSTD. Both support threading -T0 so they should be able to outperform the baseline gzip.

compression algorithms

raw info

All tests done with a Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz with other random usage.

Results could become more accurate with echo 1 > /proc/sys/vm/drop_caches and less background io.

flugcat-20210314.log:

zstd:

zstd - -T0 -9:

zstd - -T0 -19 (max):

xz:

xz - -T0:

gzip:

flugcat log processing

Grep through the messages and pull out the ones with flight info; MSG,1.

uncompressed:

✗ time grep ^MSG,1 flugcat-20210314.log | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} ^MSG,1   2.55s user 0.42s system 87% cpu 3.404 total
awk -F , '{ print $5","$11 }'  0.88s user 0.03s system 26% cpu 3.400 total
sort  0.53s user 0.04s system 16% cpu 3.518 total
uniq > /dev/null  0.03s user 0.00s system 0% cpu 3.516 total

zstd:

✗ time zstdgrep ^MSG,1 flugcat-20210314.log.zst | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
zstdgrep ^MSG,1 flugcat-20210314.log.zst  4.60s user 1.06s system 156% cpu 3.630 total
awk -F , '{ print $5","$11 }'  0.90s user 0.10s system 27% cpu 3.629 total
sort  0.54s user 0.03s system 15% cpu 3.749 total
uniq > /dev/null  0.03s user 0.00s system 0% cpu 3.748 total

xz:

✗ time xzgrep ^MSG,1 flugcat-20210314.log.xz | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
xzgrep ^MSG,1 flugcat-20210314.log.xz  15.11s user 2.13s system 128% cpu 13.392 total
awk -F , '{ print $5","$11 }'  0.97s user 0.10s system 7% cpu 13.391 total
sort  0.53s user 0.05s system 4% cpu 13.522 total
uniq > /dev/null  0.03s user 0.00s system 0% cpu 13.522 total

gzip:

 ✗ time zgrep ^MSG,1 flugcat-20210314.log.gz | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
zgrep ^MSG,1 flugcat-20210314.log.gz  12.64s user 1.88s system 140% cpu 10.363 total
awk -F , '{ print $5","$11 }'  1.03s user 0.06s system 10% cpu 10.363 total
sort  0.56s user 0.02s system 5% cpu 10.484 total
uniq > /dev/null  0.02s user 0.01s system 0% cpu 10.483 total

conclusions

ZSTD works really well and I agree with Arch linux deciding to use it in their packaging.

Personally, I'll be using -T0 -9 to save some space (1%) vs gzip but still reasonably quick to compress.