How to read csv files with specific pattern from nested file directories in pandas?-CodePudding

I intend to read csv file with specific pattern from nested file directories where each subdirectories has multiple csv files and I only want to read the one end with specific pattern. I already figured out the way of doing this in R, but wanted to do this in pandas. I found couple of useful post but can't able to read the file that I wanted to read in pandas.

current attempt

here is file structure that I have and wanted to read file start with Z_19_xx.csv. for instance:

import pandas as pd

dir1="demo2020/p1 pop/csv/Z_19_master.csv"
f1=pd.read_csv(dir1)

this is hard coded and I want to avoid of doing this. Below is the file structure:

demo2020
    - p1 pop
        -csv
            - A_17_master.csv
            - A_18_master.csv
            - B_18_master.csv
            - C_19_master.csv
            - Z_19_master.csv
    - p2 cop
        -csv
            - A_17_cop.csv
            - A_18_cop.csv
            - B_18_cop.csv
            - C_19_cop.csv
            - Z_19_cop.csv
    - p3 res
        -csv
            - A_17_res.csv
            - A_18_res.csv
            - B_18_res.csv
            - C_19_res.csv
            - Z_19_res.csv
    - p4 nac
        -csv
            - A_17_nac.csv
            - A_18_nac.csv
            - B_18_nac.csv
            - C_19_nac.csv
            - Z_19_nac.csv

my current attempt in R:

here is my R code to do this in handy:

yr=19
dir="demo2020/"
files <-c(f1  = paste0("p1 pop/csv/Z_", yr, "_master.csv") , 
                    f2 = paste0('p2 cop/csv/Z_', yr,'_cop.csv') ,
                    f3 = paste0('p3 res/csv/Z_', yr,'_res.csv') , 
                    f4  = paste0('p4 nac/csv/Z_', yr,'_nac.csv') 
)

path=(paste0(dir,files))
> path
[1] "demo2020/p1 pop/csv/Z_19_master.csv"
[2] "demo2020/p2 cop/csv/Z_19_cop.csv"   
[3] "demo2020/p3 res/csv/Z_19_res.csv"   
[4] "demo2020/p4 nac/csv/Z_19_nac.csv"

# read them

for(i in 1:length(files))
{
    f <- assign(names(files[i]), read.csv(paste0(dir, files[i]),stringsAsFactors = FALSE,skip = 1))
}

python objective - pandas

I want to do this in python without hard coding, and simply want to use above R code logic in python and use pandas to read csv files. So far, here is my attempt:

import pandas
import os

parent_dir = 'demo2020/'
subject_dirs = [os.path.join(parent_dir, dir) for dir in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, dir))]

filelist = []
for dir in subject_dirs:
    csv_files = [os.path.join(dir, csv) for csv in os.listdir(dir) if os.path.isfile(os.path.join(dir, csv)) and and csv.startswith('Z_') and csv.endswith('.csv')]
    for file in csv_files:
        df=pd.read_csv(file)
        filelist.append(df)

but still not getting this right, I only want to read Z_19_xx.csv from each subfolder and concatenate them. How can we do this nicely in python? Can anyone point me out of making this right in python? Any idea?

CodePudding user response：

You can use glob pattern to match the files: demo2020/p*/csv/Z_*.csv

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)

CodePudding user response：

You can use Glob() function to find files recursively in Python. With glob, we can also use wildcards ("*, ?, [ranges]) apart from exact string search to make path retrieval more simple and convenient.

If you want to match the files: Z_19_xx.csv instead of Z_xx.csv, you can use the following:

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_19_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)