Parallel gzip compression with pigz
Published:
The unfortunate thing about dealing with large volumes of sequencing data is that the analysis of this data can take a lot of time. I’m okay with computational bottlenecks if I’m going to see exciting results once it’s finished running. When it’s decompressing the raw data so you can start? Not so much.
While many tools can now handle gzipped input files, if you do need to gzip
or gunzip
large files and tend to find yourself with nothing to do (ha!) while this happens, try using pigz
.
This tool was brought to my attention by Dave Tang, and I was surprised to learn (though I should have checked) that gzip
doesn’t normally run in parallel. pigz
, written by Mark Adler, is basically parallel gzip that will significantly speed things up. By default, it uses the same number of threads as there are online processors for compression.
How speedy is it? I tested the compression and decompression of some smaller files (~8G uncompressed) and one large (~70G uncompressed) file using pigz
and gzip
with no additional options. Note that this was done on a server with 96 cores and 1TB RAM that was not heavily in use at the time.
Installing pigz
# In ~/bin, get pigz (replace this URL with the latest version)
wget https://zlib.net/pigz/pigz-2.3.4.tar.gz
# Decompress pigz
tar -xzvf pigz-2.3.4.tar.gz
# Install pigz
cd pigz-2.3.4
make
# Add this line to ~/.bashrc file and refresh
# export PATH=$HOME/bin/pigz-2.3.4:$PATH
source ~/.bashrc
Compression of 8 smaller files
Each file is 8.7 or 8.8G in size:
# With gzip
time gzip /RAW/Lappan/shotgun_raw/original_data/808*
real 128m17.322s
user 127m15.664s
sys 0m53.892s
# With pigz
time pigz /RAW/Lappan/shotgun_raw/original_data/808*
real 3m17.983s
user 132m10.900s
sys 1m38.176s
Decompression of 8 smaller files
Each file is now 2.4 or 2.5G in size:
# With gzip
time gunzip /RAW/Lappan/shotgun_raw/original_data/808*
real 8m15.766s
user 6m53.032s
sys 1m22.136s
# With pigz
time unpigz /RAW/Lappan/shotgun_raw/original_data/808*
real 4m26.587s
user 5m27.408s
sys 2m3.336s
Note that decompression is not actually parallelised by pigz
(apparently it can’t be), but separate threads are created for different functions of decompression.
Compression of 1 large file
The large file is 70G uncompressed:
# With gzip
time gzip /RAW/Lappan/pigz_test.fastq
real 128m10.759s
user 127m19.204s
sys 0m43.732s
# With pigz
time pigz /RAW/Lappan/pigz_test.fastq
real 3m19.088s
user 131m45.328s
sys 1m42.384s
Decompression of 1 large file
The large file is now 20G compressed:
# With gzip
time gunzip /RAW/Lappan/pigz_test.fastq.gz
real 7m53.338s
user 6m50.192s
sys 1m2.652s
# With pigz
time unpigz /RAW/Lappan/pigz_test.fastq.gz
real 4m23.894s
user 5m25.096s
sys 2m8.820s
Looking at the real
time (how many minutes actually passed while the command was running), pigz
offers a huge improvement for both compression and decompression, though decompression is quite fast regardless.
So, use pigz
!
If you have any questions or comments, please email me or find me on Twitter.
Leave a Comment