Home > Back-end >  How to merge multiple CSV files into one data frame with datetime index
How to merge multiple CSV files into one data frame with datetime index

Time:10-12

I have multiple CSV files of time series data - each file is for each day of the month. Every file has one datetime column followed by multiple other columns. I want to merge (not pd.concat) all the files into one data frame. How can I loop to select the files from the folder and merge it?

Basically, what I am trying to do is, read the first 2 CSV files and merge it into a dataframe ('df_merged'), then pick the third csv file and merge it with the dataframe created and then pick the fourth csv file and merge it to the dataframe. Could you please tell me how I can do it using a loop? Note : The first merge should not be a outer merge.

What I have so far is without the for loop. Code for just 4 files

## Reading the csv files
df_file_one = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-01.csv")
df_file_two = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-02.csv")
df_file_three = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-03.csv")
df_file_four = pd.read_csv("C:/Users/Desktop/2022/prices_Jan_2022-01-04.csv")

#Merging the files
df_merged = pd.merge(df_file_one, df_file_two, left_index=True, right_index=True) # The first merge should not be outer merge.
df_merged = pd.merge(df_merged, df_file_three, left_index=True, right_index=True,how='outer')
df_merged = pd.merge(df_merged, df_file_four, left_index=True, right_index=True, how = 'outer')

CodePudding user response:

If the order of files doesn't matter then we can loop through the files like this:

from glob import glob
import pandas as pd

files = glob("C:/Users/Desktop/2022/prices_*.csv")
df = pd.merge(
    pd.read_csv(files.pop()),
    pd.read_csv(files.pop()),
    left_index=True, 
    right_index=True
)

while files:
    df = pd.merge(
        df, 
        pd.read_csv(files.pop()), 
        left_index=True, 
        right_index=True,
        how='outer'
    )

If the order is meaningful but all you have is the files for a single month named like prices_Jan_2022-01-02.csv then we can use the natural lexicographic order like this:

files = sorted(glob(...), reverse=True)

If more then one month is presented, but the name pattern is still the same, then we can use the yyyy-mm-dd pattern at the end of names as a key to sort files:

files = sorted(glob(...), key=lambda f: f[-14, -4], reverse=True)

All the other code stay the same.

  • Related