Pandas read_csv throws ValueError while reading gzip file-CodePudding

I am trying to read a gzip file using pandas.read_csv like so:

import pandas as pd
df = pd.read_csv("data.ZIP.gz", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)

But it throws this error:

ValueError: Passed header names mismatches usecols

However, if I manually extract the zip file from gz file, then read_csv if able to read the data without errors:

df = pd.read_csv("data.ZIP", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)

Since I have to read a lot of these files I don't want to manually extract them. So, how can I fix this error?

CodePudding user response：

use the gzip module to unzip all your files somethings like this

 for file in list_file_names:
    file_name=file.replace(".gz","")
    with gzip.open(file, 'rb') as f:
        file_content = f.read()
        with open(file_name,"wb") as r:
            r.write(file_content)

CodePudding user response：

You can use zipfile module, such as :

import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)