Parallel gzip compression with pigz

The unfortunate thing about dealing with large volumes of sequencing data is that the analysis of this data can take a lot of time. I’m okay with computational bottlenecks if I’m going to see exciting results once it’s finished running. When it’s decompressing the raw data so you can start? Not so much.

While many tools can now handle gzipped input files, if you do need to gzip or gunzip large files and tend to find yourself with nothing to do (ha!) while this happens, try using pigz.

This tool was brought to my attention by Dave Tang, and I was surprised to learn (though I should have checked) that gzip doesn’t normally run in parallel. pigz, written by Mark Adler, is basically parallel gzip that will significantly speed things up. By default, it uses the same number of threads as there are online processors for compression.

How speedy is it? I tested the compression and decompression of some smaller files (~8G uncompressed) and one large (~70G uncompressed) file using pigz and gzip with no additional options. Note that this was done on a server with 96 cores and 1TB RAM that was not heavily in use at the time.

Installing pigz

# In ~/bin, get pigz (replace this URL with the latest version)

# Decompress pigz
tar -xzvf pigz-2.3.4.tar.gz

# Install pigz
cd pigz-2.3.4

# Add this line to ~/.bashrc file and refresh
# export PATH=$HOME/bin/pigz-2.3.4:$PATH
source ~/.bashrc

Compression of 8 smaller files

Each file is 8.7 or 8.8G in size:

# With gzip
time gzip /RAW/Lappan/shotgun_raw/original_data/808*

real    128m17.322s
user    127m15.664s
sys     0m53.892s

# With pigz
time pigz /RAW/Lappan/shotgun_raw/original_data/808*

real    3m17.983s
user    132m10.900s
sys     1m38.176s

Decompression of 8 smaller files

Each file is now 2.4 or 2.5G in size:

# With gzip
time gunzip /RAW/Lappan/shotgun_raw/original_data/808*

real    8m15.766s
user    6m53.032s
sys     1m22.136s

# With pigz
time unpigz /RAW/Lappan/shotgun_raw/original_data/808*

real    4m26.587s
user    5m27.408s
sys     2m3.336s

Note that decompression is not actually parallelised by pigz (apparently it can’t be), but separate threads are created for different functions of decompression.

Compression of 1 large file

The large file is 70G uncompressed:

# With gzip
time gzip /RAW/Lappan/pigz_test.fastq

real    128m10.759s
user    127m19.204s
sys     0m43.732s

# With pigz
time pigz /RAW/Lappan/pigz_test.fastq

real    3m19.088s
user    131m45.328s
sys     1m42.384s

Decompression of 1 large file

The large file is now 20G compressed:

# With gzip
time gunzip /RAW/Lappan/pigz_test.fastq.gz

real    7m53.338s
user    6m50.192s
sys     1m2.652s

# With pigz
time unpigz /RAW/Lappan/pigz_test.fastq.gz

real    4m23.894s
user    5m25.096s
sys     2m8.820s

Looking at the real time (how many minutes actually passed while the command was running), pigz offers a huge improvement for both compression and decompression, though decompression is quite fast regardless.

So, use pigz!

If you have any questions or comments, please email me or find me on Twitter.

Written on August 7, 2017