Announcing Bito’s free open-source sponsorship program. Apply now

Get high quality AI code reviews

Unzip Gz File Python: A Step-By-Step Guide

Table of Contents

Gzip compressed (gz) files are commonly used for reducing file sizes for transmission and storage, especially in Linux and Unix environments. Python contains built-in functionality for handling gzip compression, making it easy to read and write gzipped files. This comprehensive guide will walk through the key steps and techniques for unzipping gz files using Python.Gzip is a popular compression algorithm that produces files ending with a “.gz” extension. The compression reduces file size at the cost of non-human readability. Gzipped files must be uncompressed before the original contents can be accessed and interpreted.

Python has a gzip module included in its standard library for working with gzipped files. The module provides functions for opening, reading, writing, and unzipping files compressed with gzip. There are also alternative methods using Python’s shutil, tarfile, pandas, and sh modules.

We will cover the core approaches for unzipping gz files using Python:

  • Reading the contents of gzipped files
  • Extracting gzipped files to access the original uncompressed data
  • Uncompressing multiple gzipped files in a directory
  • Alternative modules for handling gz files

Understanding these key techniques will equip you with the skills needed to work with gzip compression in Python.

Reading Gz File Contents

The simplest way to access the contents of a gzipped file in Python is to use the gzip.open() method. This will return a file-like object that can be read from or written to just like a normal file.

Here is an example:

import gzip

with gzip.open('file.gz', 'rb') as f:
  file_content = f.read()

print(file_content)

The 'rb' mode opens the gzipped file for reading bytes. The decompressed file content can then be accessed using f.read() or read line-by-line within a for loop.

This allows you to access the uncompressed data as if the file was never compressed.

Unzipping Gz Files

While reading the contents provides access to the decompressed data, the actual gz file remains compressed. To fully unzip and extract the contents, you need to write the decompressed data to a new file.

The following example demonstrates opening a gz file, decompressing the data, and saving it to a new uncompressed file:

import gzip
import shutil

with gzip.open('file.gz', 'rb') as f_in:
  with open('file.txt', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)

The shutil.copyfileobj() method efficiently copies data between two file objects. By passing the uncompressed gzip file object to the new uncompressed file object, it writes the decompressed data as it reads with no temporary buffers.

This results in a new file.txt containing the original uncompressed data. The file.gz remains intact.

Unzipping Multiple Gz Files

Often you will need to decompress many gzipped files in a directory rather than a single file. This involves looping through each file, checking for a .gz extension, and extracting the contents.

Here is some sample code to unzip all gz files in a directory:

import os
import gzip
import shutil

for filename in os.listdir('.'):
  if filename.endswith('.gz'): 
     with gzip.open(filename, 'rb') as f_in:
       with open(filename[:-3], 'wb') as f_out:
         shutil.copyfileobj(f_in, f_out)

Python’s os.listdir() method retrieves all files in the current directory. The loop iterates through each file name and checks for a .gz extension using filename.endswith('.gz').

If the file is gzipped, it opens the file, decompresses it using the gzip module, and writes the output to a file without the .gz extension.

This efficiently extracts all gzipped files in a directory while preserving the original compressed versions.

Handling Large Gzipped Files

When working with large gzipped files, you may want to stream the data in chunks rather than loading the entire decompressed file into memory.

This can be accomplished by reading chunks from the gzipped file object and writing them to the output file in a loop:

import gzip

chunk_size = 4096
with gzip.open('large_file.gz', 'rb') as f_in:
  with open('large_file.txt', 'wb') as f_out:
    chunk = f_in.read(chunk_size)
    while chunk:
      f_out.write(chunk)
      chunk = f_in.read(chunk_size) 

The read() method is passed a chunk size in bytes. It will return compressed chunks of that size that can then be written to the output file. This minimizes memory usage while decompressing large gzipped files.

Alternative Methods

While Python’s gzip module is the primary tool for working with gzipped files, there are a few alternative options:

  • tarfile – This module can open and extract .tar.gz archives. The 'r:gz' mode decompresses the data.
  • pandas – The read_json() and read_csv() methods support on-the-fly decompression of gzipped JSON and CSV files.
  • sh – Provides a gunzip command through the sh module as a subprocess call.

For example:

import sh
sh.gunzip('file.gz')

These alternatives can be useful for specific gzipped file types or if you need to interface with external command line tools.

Conclusion

Python provides great native support for handling gzip compression, making it easy to read and decompress gzipped files.

The key steps are:

  • Use gzip.open() to open and read gzipped files
  • Write decompressed data to a new file object to extract the original uncompressed content
  • Loop through a directory to unzip all .gz files
  • Use utility modules like tarfile and pandas for special file types
  • Stream decompressed data in chunks to handle large gzipped files efficiently

Following this guide, you should now have the knowledge to proficiently unzip gzip files in Python and integrate gzip compression into your applications.

Picture of Nisha Kumari

Nisha Kumari

Nisha Kumari, a Founding Engineer at Bito, brings a comprehensive background in software engineering, specializing in Java/J2EE, PHP, HTML, CSS, JavaScript, and web development. Her career highlights include significant roles at Accenture, where she led end-to-end project deliveries and application maintenance, and at PubMatic, where she honed her skills in online advertising and optimization. Nisha's expertise spans across SAP HANA development, project management, and technical specification, making her a versatile and skilled contributor to the tech industry.

Written by developers for developers

This article was handcrafted with by the Bito team.

Latest posts

Mastering Python’s writelines() Function for Efficient File Writing | A Comprehensive Guide

Understanding the Difference Between == and === in JavaScript – A Comprehensive Guide

Compare Two Strings in JavaScript: A Detailed Guide for Efficient String Comparison

Exploring the Distinctions: == vs equals() in Java Programming

Understanding Matplotlib Inline in Python: A Comprehensive Guide for Visualizations

Top posts

Mastering Python’s writelines() Function for Efficient File Writing | A Comprehensive Guide

Understanding the Difference Between == and === in JavaScript – A Comprehensive Guide

Compare Two Strings in JavaScript: A Detailed Guide for Efficient String Comparison

Exploring the Distinctions: == vs equals() in Java Programming

Understanding Matplotlib Inline in Python: A Comprehensive Guide for Visualizations

Get Bito for IDE of your choice