For each city (here : NY, Chicago) I have 3 csv files with 2 columns like this :
file 1 : ID, 20101201
file 2 : ID, 20101202
file 3 : ID, 20101203
Each file name is like this : "Chicago_ID_20101201.csv"
The 2nd column name is representing a date in this format YYYYMMDD.
I want to create a unique file for each city with a dataframe containing 4 columns: ID and the 3 other columns referring to each date in these files.
cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]
for city in cities:
df = pd.DataFrame()
for date in dates:
file_name = f'{city}_ID_{date}.csv'
df[date] = pd.read_csv('[...]')
print(df[date])
Plus i would like to know if there is a way to avoid giving the list of dates in the case that i would want to do it for an entire month.
Thanks
CodePudding user response:
You can read each dataframe, store them in a list, set the ID as index and concatenate them to get one ID column and three other date columns:
cities = ["NY","Chicago"]
dates = ["20101201", "20101202","20101203"]
for city in cities:
df_list=[]
for date in dates:
file_name = f'{city}_ID_{date}.csv'
df_list.append(pd.read_csv(file_name, index_col='ID'))
df = pd.concat(df_list, axis=1)
print(f'This is the dataframe for {city}', df)
For your second question, you can create any date range, using pandas daterange:
pd.date_range(start="20101101", end="20101201", freq='D').strftime('%Y%m%d')
Output:
Index(['20101101', '20101102', '20101103', '20101104', '20101105', '20101106',
'20101107', '20101108', '20101109', '20101110', '20101111', '20101112',
'20101113', '20101114', '20101115', '20101116', '20101117', '20101118',
'20101119', '20101120', '20101121', '20101122', '20101123', '20101124',
'20101125', '20101126', '20101127', '20101128', '20101129', '20101130',
'20101201'],
dtype='object')
CodePudding user response:
Use pathlib
:
import pandas as pd
import pathlib
import collections
DATA_DIR = pathlib.Path('.') # the path to your csv files
cities = collections.defaultdict(list)
# Collect data
for file in DATA_DIR.glob('*_ID_*.csv'):
city = file.stem.split('_')[0]
df = pd.read_csv(file, dtype=object).set_index('ID')
cities[city].append(df)
# Build city files
for city in cities:
df = pd.concat(cities[city], axis=1).reset_index()
df.to_excel(f'{city}.xlsx', index=False)
Now you have two files Chicago.xlsx and NY.xlsx.