How to read and manipulate multiple CSV files using pandas and for-loop?-CodePudding

I want to read a list of CSV files, for example exon_kipan.00001.csv, exon_kipan.00002.csv, exon_kipan.00003.csv, and exon_kipan.00004.csv (24 files in total), and then perform a series of operations using pandas before concatenating the dataframes.

For a single file, I would do:

df= pd.read_csv("exon_kipan.csv", sep="\t", index_col=0, low_memory=False)
df= df[df.columns[::3]]
df= df.T 
del df[df.columns[0]]
df.index = df.index.str.upper()
df= df.sort_index()
df.index = ['-'.join( s.split('-')[:4]) for s in df.index.tolist() ]
df.rename_axis(None, axis=1, inplace=True)

However, now I want to read, manipulate, and concatenate multiple files.

filename = '/work/exon_kipan.{}.csv'
df_dict = {}
exon_clin_list = []
for i in range(1, 25):
    df_dict[i] = pd.read_csv(filename, sep="\t", index_col=0, low_memory=False)
    df_dict[i] = df_dict[i][df_dict[i].columns[::3]]
    df_dict[i] = df_dict[i].T
    del df_dict[i][df_dict[i].columns[0]]
    df_dict[i].index = df_dict[i].index.str.upper()
    df_dict[i] = df_dict[i].sort_index()
    df_dict[i].index = ['-'.join( s.split('-')[:4]) for s in df_dict[i].index.tolist() ]
    df_dict[i].rename_axis(None, axis=1, inplace=True)

    exon_clin_list.append(df_dict[i])

exon_clin = pd.concat(df_list)

My code raised:

FileNotFoundError: [Errno 2] No such file or directory: '/work/exon_kipan.{}.csv'

CodePudding user response：

You have to use format method of str:

filename = '/work/exon_kipan.{:05}.csv'  # <- don't forget to modify here
...
for i in range(1, 25):
    df_dict[i] = pd.read_csv(filename.format(i), ...)

Test:

filename = '/work/exon_kipan.{:05}.csv'
for i in range(1, 25):
    print(filename.format(i))

# Output
/work/exon_kipan.00001.csv
/work/exon_kipan.00002.csv
/work/exon_kipan.00003.csv
/work/exon_kipan.00004.csv
/work/exon_kipan.00005.csv
/work/exon_kipan.00006.csv
/work/exon_kipan.00007.csv
/work/exon_kipan.00008.csv
/work/exon_kipan.00009.csv
/work/exon_kipan.00010.csv
/work/exon_kipan.00011.csv
/work/exon_kipan.00012.csv
/work/exon_kipan.00013.csv
/work/exon_kipan.00014.csv
/work/exon_kipan.00015.csv
/work/exon_kipan.00016.csv
/work/exon_kipan.00017.csv
/work/exon_kipan.00018.csv
/work/exon_kipan.00019.csv
/work/exon_kipan.00020.csv
/work/exon_kipan.00021.csv
/work/exon_kipan.00022.csv
/work/exon_kipan.00023.csv
/work/exon_kipan.00024.csv

CodePudding user response：

The issue is with the way you are constructing the file path. In the following line:

filename = '/work/exon_kipan.{}.csv'

The curly brackets are being treated as literal characters and not being replaced with the value of i.

To fix this, you can use the format method of the str class, like this:

filename = '/work/exon_kipan.{}.csv'.format(i)

This will replace the curly brackets with the value of i.

So, your final code should look like this:

df_dict = {}
exon_clin_list = []
for i in range(1, 25):
    filename = '/work/exon_kipan.{}.csv'.format(i)
    df_dict[i] = pd.read_csv(filename, sep="\t", index_col=0, low_memory=False)
    df_dict[i] = df_dict[i][df_dict[i].columns[::3]]
    df_dict[i] = df_dict[i].T
    del df_dict[i][df_dict[i].columns[0]]
    df_dict[i].index = df_dict[i].index.str.upper()
    df_dict[i] = df_dict[i].sort_index()
    df_dict[i].index = ['-'.join( s.split('-')[:4]) for s in df_dict[i].index.tolist() ]
    df_dict[i].rename_axis(None, axis=1, inplace=True)

    exon_clin_list.append(df_dict[i])

exon_clin = pd.concat(df_list)

CodePudding user response：

may be something like this will work

#write a function to read file do some processing and return a dataframe
def read_file_and_do_some_actions(filename):
    df = pd.read_csv(filename, index_col=None, header=0)
    #############################
    #do some processing
    #############################
    return df


path = r'/home/tester/inputdata/exon_kipan'
all_files = glob.glob(os.path.join(path, "/work/exon_kipan.*.csv"))


#for each file in all_files list, call function read_file_and_do_some_actions and then concatenate all the dataframes into one dataframe
df = pd.concat((read_file_and_do_some_actions(f) for f in all_files), ignore_index=True)