Home > database >  How can I process only files with a certain name?
How can I process only files with a certain name?

Time:01-22

I am generating a python code that automatically processes and combines JSON datasets. Meanwhile, when I access each folder, there are two JSON datasets in a folder, which are, for example

  • download/2019/201901/dragon.csv
  • download/2019/201901/kingdom.csv

and the file names are the same across all folders. In other words, each folder has two datasets with the name above. in the 'download' folder, there are 4 folders, 2019, 2020, 2021, 2022, and in the folder of each year, there are folders for each month, e.g., 2019/201901, 2019/201902, ~~ In this situation, I want to process only 'dragon.csv's. I wonder how I can do it. my current code is

import os
import pandas as pd
import numpy as np

path = 'download/2019'
save_path = 'download'

class Preprocess:
    
    def __init__(self, path, save_path):  
        self.path = path
        self.save_path = save_path

after finishing processing,

def save_dataset(path, save_path):

    for dir in os.listdir(path):
        for file in os.listdir(os.path.join(path, dir)):
            if file[-3:] == 'csv':
                df = pd.read_csv(os.path.join(path, dir, file))
                print(f'Reading data from {os.path.join(path, dir, file)}')

                print('Start Preprocessing...')
                df = preprocessing(df)
                print('Finished!')
                
                if not os.path.exists(os.path.join(save_path, dir)):
                    os.makedirs(os.path.join(save_path, dir))
                df.to_csv(os.path.join(save_path, dir, file), index=False)

save_dataset(path, save_path)

CodePudding user response:

You can use pathlib's glob method:

from pathlib import Path

p = Path()  # nothing if you're in the folder containing `download` else point to that folder

dragons_paths = p.glob("download/**/dragons.csv")

dragons_paths contains a generator that will point to all the dragons.csv files under download folder.

PS. You should avoid shadowing dir, maybe call your variable dir_ or d.

CodePudding user response:

If I understand your question, you only want to process files that include the substring "dragon". You could do this by adding a conditional to your if-clause. So instead of writing if file[-3:] == 'csv' simply write if file[-3:] == 'csv' and 'dragon' in file

  • Related