I am trying to read a gzip file using pandas.read_csv
like so:
import pandas as pd
df = pd.read_csv("data.ZIP.gz", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)
But it throws this error:
ValueError: Passed header names mismatches usecols
However, if I manually extract the zip file from gz file, then read_csv
if able to read the data without errors:
df = pd.read_csv("data.ZIP", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)
Since I have to read a lot of these files I don't want to manually extract them. So, how can I fix this error?
CodePudding user response:
use the gzip
module to unzip all your files somethings like this
for file in list_file_names:
file_name=file.replace(".gz","")
with gzip.open(file, 'rb') as f:
file_content = f.read()
with open(file_name,"wb") as r:
r.write(file_content)
CodePudding user response:
You can use zipfile
module, such as :
import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
zip_ref.extractall(directory_to_extract_to)