Parallel gzip compression with pigz
The unfortunate thing about dealing with large volumes of sequencing data is that the analysis of this data can take a lot of time. I’m okay with computational bottlenecks if I’m going to see exciting results once it’s finished running. When it’s decompressing the raw data so you can start? Not so much.
While many tools can now handle gzipped input files, if you do need to
gunzip large files and tend to find yourself with nothing to do (ha!) while this happens, try using
This tool was brought to my attention by Dave Tang, and I was surprised to learn (though I should have checked) that
gzip doesn’t normally run in parallel.
pigz, written by Mark Adler, is basically parallel gzip that will significantly speed things up. By default, it uses the same number of threads as there are online processors for compression.
How speedy is it? I tested the compression and decompression of some smaller files (~8G uncompressed) and one large (~70G uncompressed) file using
gzip with no additional options. Note that this was done on a server with 96 cores and 1TB RAM that was not heavily in use at the time.
# In ~/bin, get pigz (replace this URL with the latest version) wget https://zlib.net/pigz/pigz-2.3.4.tar.gz # Decompress pigz tar -xzvf pigz-2.3.4.tar.gz # Install pigz cd pigz-2.3.4 make # Add this line to ~/.bashrc file and refresh # export PATH=$HOME/bin/pigz-2.3.4:$PATH source ~/.bashrc
Compression of 8 smaller files
Each file is 8.7 or 8.8G in size:
# With gzip time gzip /RAW/Lappan/shotgun_raw/original_data/808* real 128m17.322s user 127m15.664s sys 0m53.892s # With pigz time pigz /RAW/Lappan/shotgun_raw/original_data/808* real 3m17.983s user 132m10.900s sys 1m38.176s
Decompression of 8 smaller files
Each file is now 2.4 or 2.5G in size:
# With gzip time gunzip /RAW/Lappan/shotgun_raw/original_data/808* real 8m15.766s user 6m53.032s sys 1m22.136s # With pigz time unpigz /RAW/Lappan/shotgun_raw/original_data/808* real 4m26.587s user 5m27.408s sys 2m3.336s
Note that decompression is not actually parallelised by
pigz (apparently it can’t be), but separate threads are created for different functions of decompression.
Compression of 1 large file
The large file is 70G uncompressed:
# With gzip time gzip /RAW/Lappan/pigz_test.fastq real 128m10.759s user 127m19.204s sys 0m43.732s # With pigz time pigz /RAW/Lappan/pigz_test.fastq real 3m19.088s user 131m45.328s sys 1m42.384s
Decompression of 1 large file
The large file is now 20G compressed:
# With gzip time gunzip /RAW/Lappan/pigz_test.fastq.gz real 7m53.338s user 6m50.192s sys 1m2.652s # With pigz time unpigz /RAW/Lappan/pigz_test.fastq.gz real 4m23.894s user 5m25.096s sys 2m8.820s
Looking at the
real time (how many minutes actually passed while the command was running),
pigz offers a huge improvement for both compression and decompression, though decompression is quite fast regardless.