Compress excel file in python-CodePudding

Right now my final output is in excel format. I wanted to compressed my excel file using gzip. Is there a way to do it ?

import pandas as pd
import gzip
import re

def renaming_ad_unit():
    with gzip.open('weekly_direct_house.xlsx.gz') as f:
        df = pd.read_excel(f)
        result = df['Ad unit'].to_list()
        for index, a_string in enumerate(result):
            modified_string = re.sub(r"\([^()]*\)", "", a_string)
            df.at[index,'Ad unit'] = modified_string

    return df.to_excel('weekly_direct_house.xlsx',index=False)

CodePudding user response：

Yes, this is possible.

To create a gzip file, you can open the file like this:

with gzip.open('filename.xlsx.gz', 'wb') as f:
    ...

Unfortunately, when I tried this, I found that I get the error OSError: Negative seek in write mode. This is because the Pandas excel writer moves backwards in the file when writing, and uses multiple passes to write the file. This is not allowed by the gzip module.

To fix this, I created a temporary file, and wrote the excel file there. Then, I read the file back, and write it to the compressed archive.

I wrote a short program to demonstrate this. It reads an excel file from a gzip archive, prints it out, and writes it back to another gzip file.

import pandas as pd
import gzip
import tempfile

def main():
    with gzip.open('apportionment-2020-table02.xlsx.gz') as f:
        df = pd.read_excel(f)
        print(df)

    with tempfile.TemporaryFile() as excel_f:
        df.to_excel(excel_f, index=False)
        with gzip.open('output.xlsx.gz', 'wb') as gzip_f:
            excel_f.seek(0)
            gzip_f.write(excel_f.read())

if __name__ == '__main__':
    main()

Here's the file I'm using to demonstrate this: Link

CodePudding user response：

You could also use io.BytesIO to create file in memory and write excel in this file and next write this file as gzip on disk.

I used link to excel file from Nick ODell answer.

import pandas as pd
import gzip
import io

df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')

buf = io.BytesIO()

df.to_excel(buf)

buf.seek(0)  # move to the beginning of file

with gzip.open('output.xlsx.gz', 'wb') as f:
    f.write(buf.read())