I collect a few GB/day of logfiles from my ADSB antennae. Storing and processing these is mostly automated so gzip has been adequate.
Looking to improve I've tested a few newer methods, LZMA (xz), lz4, and ZSTD. Both support threading -T0
so they should be able to outperform the baseline gzip.
lz4 and zstd have benchmark modes! Both can compress and decompress gigabytes per second. lz4 is particularly fast but with much lower compression ratios. Threading in zstd is also quite fast but makes the single thread of lz4 even more impressive.
1#gcat-20210314.log :2713597522 -> 240069090 (x11.30), 9057.1 MB/s, 2839.3 MB/s
2#gcat-20210314.log :2713597522 -> 241187356 (x11.25), 8020.8 MB/s, 2682.4 MB/s
3#gcat-20210314.log :2713597522 -> 238182761 (x11.39), 6223.0 MB/s, 2824.2 MB/s
4#gcat-20210314.log :2713597522 -> 238218499 (x11.39), 5486.5 MB/s, 2826.7 MB/s
5#gcat-20210314.log :2713597522 -> 212888534 (x12.75), 2089.2 MB/s, 2944.4 MB/s
6#gcat-20210314.log :2713597522 -> 199991356 (x13.57), 1674.5 MB/s, 3405.0 MB/s
7#gcat-20210314.log :2713597522 -> 196728704 (x13.79), 1472.0 MB/s, 3527.5 MB/s
8#gcat-20210314.log :2713597522 -> 190718403 (x14.23), 1227.1 MB/s, 3585.0 MB/s
9#gcat-20210314.log :2713597522 -> 191394435 (x14.18), 1159.9 MB/s, 3509.9 MB/s
10#gcat-20210314.log :2713597522 -> 188619770 (x14.39), 837.7 MB/s, 3630.1 MB/s
11#gcat-20210314.log :2713597522 -> 184256306 (x14.73), 577.4 MB/s, 3703.4 MB/s
12#gcat-20210314.log :2713597522 -> 184277158 (x14.73), 560.9 MB/s, 3693.2 MB/s
13#gcat-20210314.log :2713597522 -> 183289229 (x14.81), 339.8 MB/s, 3707.2 MB/s
14#gcat-20210314.log :2713597522 -> 180666285 (x15.02), 218.3 MB/s, 3837.3 MB/s
15#gcat-20210314.log :2713597522 -> 179701035 (x15.10), 110.8 MB/s, 3883.5 MB/s
16#gcat-20210314.log :2713597522 -> 174521836 (x15.55), 80.2 MB/s, 4042.0 MB/s
17#gcat-20210314.log :2713597522 -> 159367407 (x17.03), 59.2 MB/s, 3838.0 MB/s
18#gcat-20210314.log :2713597522 -> 152623233 (x17.78), 33.9 MB/s, 3839.5 MB/s
19#gcat-20210314.log :2713597522 -> 142108593 (x19.10), 18.4 MB/s, 3774.2 MB/s
1#gcat-20210314.log :2713597522 -> 240115907 (x11.30), 1161.7 MB/s, 2803.9 MB/s
2#gcat-20210314.log :2713597522 -> 241293542 (x11.25), 1151.8 MB/s, 2684.4 MB/s
3#gcat-20210314.log :2713597522 -> 238254896 (x11.39), 1078.6 MB/s, 2785.5 MB/s
4#gcat-20210314.log :2713597522 -> 238282005 (x11.39), 1031.4 MB/s, 2783.6 MB/s
5#gcat-20210314.log :2713597522 -> 212869438 (x12.75), 293.8 MB/s, 2958.8 MB/s
6#gcat-20210314.log :2713597522 -> 200145596 (x13.56), 230.2 MB/s, 3380.1 MB/s
7#gcat-20210314.log :2713597522 -> 197033080 (x13.77), 205.4 MB/s, 3500.4 MB/s
8#gcat-20210314.log :2713597522 -> 191047715 (x14.20), 168.4 MB/s, 3546.9 MB/s
9#gcat-20210314.log :2713597522 -> 191700892 (x14.16), 168.4 MB/s, 3553.8 MB/s
10#gcat-20210314.log :2713597522 -> 188884663 (x14.37), 123.6 MB/s, 3657.4 MB/s
11#gcat-20210314.log :2713597522 -> 184457109 (x14.71), 89.6 MB/s, 3741.3 MB/s
12#gcat-20210314.log :2713597522 -> 184516421 (x14.71), 88.2 MB/s, 3717.3 MB/s
13#gcat-20210314.log :2713597522 -> 183534342 (x14.79), 63.8 MB/s, 3715.5 MB/s
14#gcat-20210314.log :2713597522 -> 180909457 (x15.00), 42.0 MB/s, 3848.3 MB/s
15#gcat-20210314.log :2713597522 -> 179848184 (x15.09), 26.3 MB/s, 3908.1 MB/s
16#gcat-20210314.log :2713597522 -> 175286553 (x15.48), 14.6 MB/s, 4117.1 MB/s
17#gcat-20210314.log :2713597522 -> 158725989 (x17.10), 11.8 MB/s, 3942.7 MB/s
18#gcat-20210314.log :2713597522 -> 152608351 (x17.78), 6.65 MB/s, 3934.4 MB/s
➜ /tmp lz4 -b1 -e9 flugcat-20210314.log
Benchmarking levels from 1 to 9
File(s) bigger than LZ4's max input size; testing 2016 MB only...
1#gcat-20210314.log :2113929216 -> 336994834 (6.273),2217.7 MB/s ,6415.4 MB/s
2#gcat-20210314.log :2113929216 -> 336994834 (6.273),2217.6 MB/s ,6412.2 MB/s
3#gcat-20210314.log :2113929216 -> 225376163 (9.380), 360.4 MB/s ,8621.7 MB/s
4#gcat-20210314.log :2113929216 -> 218477309 (9.676), 268.8 MB/s ,8725.0 MB/s
5#gcat-20210314.log :2113929216 -> 214023797 (9.877), 188.9 MB/s ,8781.8 MB/s
6#gcat-20210314.log :2113929216 -> 211747950 (9.983), 132.0 MB/s ,8869.8 MB/s
7#gcat-20210314.log :2113929216 -> 210429674 (10.046), 92.3 MB/s ,8944.9 MB/s
8#gcat-20210314.log :2113929216 -> 209446711 (10.093), 64.2 MB/s ,8960.4 MB/s
9#gcat-20210314.log :2113929216 -> 208953745 (10.117), 46.8 MB/s ,8987.7 MB/s
All tests done with a Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz with other random usage.
Results could become more accurate with echo 1 > /proc/sys/vm/drop_caches
and less background io.
flugcat-20210314.log:
2713597522 bytes
zstd:
zstd flugcat-20210314.log -o flugcat-20210314.log.3.zst 6.09s user 0.66s system 114% cpu 5.881 total
243135780 bytes
unzstd flugcat-20210314.log.3.zst 2.24s user 1.58s system 99% cpu 3.833 total
zstd - -T0 -9
:
zstd -T0 -9 flugcat-20210314.log -o flugcat-20210314.log.9.zst 39.54s user 0.99s system 339% cpu 11.948 total
190202201 bytes
unzstd flugcat-20210314.log.9.zst 1.87s user 1.52s system 80% cpu 4.201 total
zstd - -T0 -19
(max):
zstd -T0 -19 flugcat-20210314.log -o flugcat-20210314.log.19.zst 2442.36s user 3.44s system 352% cpu 11:34.70 total
142548948 bytes
unzstd flugcat-20210314.log.19.zst 1.81s user 1.49s system 99% cpu 3.306 total
xz:
xz flugcat-20210314.log 971.35s user 2.50s system 99% cpu 16:15.41 total
141154572 bytes
unxz flugcat-20210314.log.xz 11.77s user 1.52s system 41% cpu 31.900 total
xz - -T0
:
xz -T0 flugcat-20210314.log 1107.33s user 2.52s system 343% cpu 5:23.36 total
142435176 bytes
unxz flugcat-20210314.log.xz 11.58s user 1.71s system 62% cpu 21.346 total
gzip:
gzip flugcat-20210314.log 37.42s user 0.98s system 99% cpu 38.495 total
217945296 bytes
gunzip flugcat-20210314.log.gz 8.84s user 1.26s system 58% cpu 17.378 total
Grep through the messages and pull out the ones with flight info; MSG,1
.
uncompressed:
✗ time grep ^MSG,1 flugcat-20210314.log | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} ^MSG,1 2.55s user 0.42s system 87% cpu 3.404 total
awk -F , '{ print $5","$11 }' 0.88s user 0.03s system 26% cpu 3.400 total
sort 0.53s user 0.04s system 16% cpu 3.518 total
uniq > /dev/null 0.03s user 0.00s system 0% cpu 3.516 total
zstd:
✗ time zstdgrep ^MSG,1 flugcat-20210314.log.zst | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
zstdgrep ^MSG,1 flugcat-20210314.log.zst 4.60s user 1.06s system 156% cpu 3.630 total
awk -F , '{ print $5","$11 }' 0.90s user 0.10s system 27% cpu 3.629 total
sort 0.54s user 0.03s system 15% cpu 3.749 total
uniq > /dev/null 0.03s user 0.00s system 0% cpu 3.748 total
xz:
✗ time xzgrep ^MSG,1 flugcat-20210314.log.xz | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
xzgrep ^MSG,1 flugcat-20210314.log.xz 15.11s user 2.13s system 128% cpu 13.392 total
awk -F , '{ print $5","$11 }' 0.97s user 0.10s system 7% cpu 13.391 total
sort 0.53s user 0.05s system 4% cpu 13.522 total
uniq > /dev/null 0.03s user 0.00s system 0% cpu 13.522 total
gzip:
✗ time zgrep ^MSG,1 flugcat-20210314.log.gz | awk -F , '{ print $5","$11 }' |sort |uniq > /dev/null
zgrep ^MSG,1 flugcat-20210314.log.gz 12.64s user 1.88s system 140% cpu 10.363 total
awk -F , '{ print $5","$11 }' 1.03s user 0.06s system 10% cpu 10.363 total
sort 0.56s user 0.02s system 5% cpu 10.484 total
uniq > /dev/null 0.02s user 0.01s system 0% cpu 10.483 total
ZSTD works really well and I agree with Arch linux deciding to use it in their packaging.
Personally, I'll be using -T0 -9
to save some space (1%) vs gzip but still reasonably quick to compress.