Create unique dataframe-CodePudding

For each city (here : NY, Chicago) I have 3 csv files with 2 columns like this :

file 1 : ID, 20101201

file 2 : ID, 20101202

file 3 : ID, 20101203

Each file name is like this : "Chicago_ID_20101201.csv"

The 2nd column name is representing a date in this format YYYYMMDD.

I want to create a unique file for each city with a dataframe containing 4 columns: ID and the 3 other columns referring to each date in these files.

cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]

for city in cities:
    df = pd.DataFrame()
    for date in dates:

        file_name = f'{city}_ID_{date}.csv' 

        df[date] = pd.read_csv('[...]')

        print(df[date])

Plus i would like to know if there is a way to avoid giving the list of dates in the case that i would want to do it for an entire month.

Thanks

CodePudding user response：

You can read each dataframe, store them in a list, set the ID as index and concatenate them to get one ID column and three other date columns:

cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]

for city in cities:
    df_list=[]
    for date in dates:
        file_name = f'{city}_ID_{date}.csv' 
        df_list.append(pd.read_csv(file_name, index_col='ID'))
    df = pd.concat(df_list, axis=1)
    print(f'This is the dataframe for {city}', df)

For your second question, you can create any date range, using pandas daterange:

pd.date_range(start="20101101", end="20101201", freq='D').strftime('%Y%m%d')

Output:

Index(['20101101', '20101102', '20101103', '20101104', '20101105', '20101106',
       '20101107', '20101108', '20101109', '20101110', '20101111', '20101112',
       '20101113', '20101114', '20101115', '20101116', '20101117', '20101118',
       '20101119', '20101120', '20101121', '20101122', '20101123', '20101124',
       '20101125', '20101126', '20101127', '20101128', '20101129', '20101130',
       '20101201'],
      dtype='object')

CodePudding user response：

Use pathlib:

import pandas as pd
import pathlib
import collections

DATA_DIR = pathlib.Path('.')  # the path to your csv files
cities = collections.defaultdict(list)

# Collect data
for file in DATA_DIR.glob('*_ID_*.csv'):
    city = file.stem.split('_')[0]
    df = pd.read_csv(file, dtype=object).set_index('ID')
    cities[city].append(df)

# Build city files
for city in cities:
    df = pd.concat(cities[city], axis=1).reset_index()
    df.to_excel(f'{city}.xlsx', index=False)

Now you have two files Chicago.xlsx and NY.xlsx.