I am generating a python code that automatically processes and combines JSON datasets. Meanwhile, when I access each folder, there are two JSON datasets in a folder, which are, for example
- download/2019/201901/dragon.csv
- download/2019/201901/kingdom.csv
and the file names are the same across all folders. In other words, each folder has two datasets with the name above. in the 'download' folder, there are 4 folders, 2019, 2020, 2021, 2022, and in the folder of each year, there are folders for each month, e.g., 2019/201901, 2019/201902, ~~ In this situation, I want to process only 'dragon.csv's. I wonder how I can do it. my current code is
import os
import pandas as pd
import numpy as np
path = 'download/2019'
save_path = 'download'
class Preprocess:
def __init__(self, path, save_path):
self.path = path
self.save_path = save_path
after finishing processing,
def save_dataset(path, save_path):
for dir in os.listdir(path):
for file in os.listdir(os.path.join(path, dir)):
if file[-3:] == 'csv':
df = pd.read_csv(os.path.join(path, dir, file))
print(f'Reading data from {os.path.join(path, dir, file)}')
print('Start Preprocessing...')
df = preprocessing(df)
print('Finished!')
if not os.path.exists(os.path.join(save_path, dir)):
os.makedirs(os.path.join(save_path, dir))
df.to_csv(os.path.join(save_path, dir, file), index=False)
save_dataset(path, save_path)
CodePudding user response:
You can use pathlib's glob method:
from pathlib import Path
p = Path() # nothing if you're in the folder containing `download` else point to that folder
dragons_paths = p.glob("download/**/dragons.csv")
dragons_paths
contains a generator that will point to all the dragons.csv
files under download
folder.
PS. You should avoid shadowing dir
, maybe call your variable dir_
or d
.
CodePudding user response:
If I understand your question, you only want to process files that include the substring "dragon". You could do this by adding a conditional to your if-clause. So instead of writing if file[-3:] == 'csv'
simply write if file[-3:] == 'csv' and 'dragon' in file