I have a zip file (stored locally) with multiple folders in it. In each folder are a few CSV files. I need to only access 1 particular CSV from each folder. The CSV's I am trying to access from each folder all share the same name, but I cannot figure out how to access a particular file from each folder, then concatenate them into a pandas df.
I have tried the below (initially trying to read all CSV's):
path = r"C:\Users\...\Downloads\folder.zip"
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
But I get: ValueError: No objects to concatenate. The CSV's are definitely present and not empty.
I am currently trying to do this in a sagemaker notebook, not sure if that is also causing me problems. Any help would be great.
CodePudding user response:
After some digging and advice from Umar.H and mad, I figured out a solution to my original question and to the code example I was originally working with.
The code I was originally working with wasn't working with accessing the zip file directly, so I unzipped the file and tried it on just a regular folder. Amending the empty list of df's li
to not return an empty list was solved by changing "/*file.csv"
in all_files to "*/*file.csv
.
To solve the main issue I had, which was to avoid unzipping the zip file and access all required CSV's I managed to get the following to work
PATH = "C:/Users/.../Downloads/folder.zip"
li = []
with zipfile.ZipFile(PATH, "r") as f:
for name in f.namelist():
if name.endswith("file.csv"):
data = f.open(name)
df = pd.read_csv(data, header=None, low_memory = False)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Hope this can be helpful for anyone else with large zip files.