I want my loop to go through 3 dfs with the same format, and do 3 things:
- Transform "Date" object column to datetime.
- Use loc to filter only the dates after year 2021.
- Set date as index.
Say I have 3 dataframes named df1, df2 and df3, they are all shaped like this:
Index | Date | Information |
---|---|---|
1 | 2020-01-01 | Blablabla |
2 | 2021-01-01 | Blablabla |
3 | 2022-01-01 | Blablabla |
After running my code, I want all dfs to be shaped like this:
Date | Information |
---|---|
2021-01-01 | Blablabla |
2022-01-01 | Blablabla |
The code I'm running:
dfs = [df1, df2, df3]
for i in dfs:
i['Date'] = pd.to_datetime(i['Date'])
i = i.loc[i['Date'].dt.year >= 2021]
i.set_index('Date', inplace=True)
I can't seem to get it to work. The first part of the loop is working, it transformms to datetime, but the filter is not working neither the indexing.
CodePudding user response:
Let's this using a few tricks of pandas and python:
df1, df2, df3 = [i.assign(Date=pd.to_datetime(i['Date'])).set_index('Date').loc['2021':]
for i in [df1, df2, df3]]
- Unpacking
pd.Dataframe.assign
- With datetime in index you can use
partial string indexing
with loc - List comprehension
Output:
print(f'{df1=}\n\n{df2=}\n\n{df3=}')
df1= Index Information
Date
2021-01-01 2 Blablabla
2022-01-01 3 Blablabla
df2= Index Information
Date
2021-01-01 2 Blablabla
2022-01-01 3 Blablabla
df3= Index Information
Date
2021-01-01 2 Blablabla
2022-01-01 3 Blablabla
Fixing your for loop can be done like this:
dfs = [df1,df2,df3]
for i in dfs:
i.assign(Date=pd.to_datetime(i['Date']), inplace=True)
i.query('Date >= "2021-01-01"', inplace=True)
i.set_index('Date', inplace=True)
df1
CodePudding user response:
step1: transform date column to 'real date'
d1['date'] = pd.to_datetime(d1['date'])
d2['date'] = pd.to_datetime(d2['date'])
d3['date'] = pd.to_datetime(d3['date'])
step2: select >= 2021 (year) rows
d1_s = d1.loc[d1['date'].dt.year >= 2021,:]
d2_s = d2.loc[d2['date'].dt.year >= 2021,:]
d3_s = d2.loc[d2['date'].dt.year >= 2021,:]
step3: pd.concat(d1_s,d2_s,d3_s) named a new dataframe 'data_all'
data_all = pd.concat([d1_s,d2_s,d3_s],ignore_index=True)
step4:set 'date' column as index
data_all.set_index(['date'], inplace=True)