Sometimes I need to uncompress large archives located in remote locations - for instance, in Amazon S3. Most normal approaches to uncompressing archives involve resource-intensive processes, like downloading the entire file into memory - an approach not tenable for multi-gigabyte files.
The solution is to stream the files, and perform the decompression and file extraction as you go. Similar to piping the output from one command into another. If you do the work of writing the output files while uncompressing them, you only need to store the current file's data in memory. Much more efficient!
smart_open library - a wrapper around many different cloud storage
providers (and more) that handles the gory details of opening those files in a
streamable manner. With this library, I was able to quickly put together a
script that would open my zip file (located in S3) and upload the resulting
files back to S3 again. No local disk, no huge memory overhead!
awscli- must be installed and configured with S3 credentials
from smart_open import open import zipfile source = 'local file path / s3:// URI' dest = 'local file path / s3:// URI' # Iterate over all entries in the zip file with open(source, 'rb') as file_data: with zipfile.ZipFile(file_data) as z: for file_info in z.infolist(): new_filename = dest + file_info.filename # Skip directories - prefixes aren't explicitly created in S3 if not file_info.is_dir(): # Stream the uncompressed file directly to the dest dest_file_data = open(new_filename, 'wb') with z.open(file_info.filename) as zip_file_data : dest_file_data.write(zip_file_data.read())