Pandas - import CSV files in folder, change column name if it contains a string, concat into one dat-CodePudding

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.

Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.

import pandas as pd
import os
import glob

path = 'c:/mydata'

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do

CodePudding user response：

You can use a combination of header and names parameters from the pd.read_csv function to solve it.

You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.

From pandas docs: names: array-like, optional List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

CodePudding user response：

If you want to change the column name, please try this:

import glob
import pandas as pd

filenames = glob.glob('c:/mydata/*.csv')
all_data = []

for file in filenames:
    df = pd.read_csv(file)
    if 'implementationstatus' in df.columns:
      df = df.rename(columns={'implementationstatus':'implementation'})

    all_data.append(df)
df_all = pd.concat(all_data, axis=0)