I am very new at this, so bear with me please.
I do this:
example=
index Date Column_1 Column_2
1 2019-06-17 Car Red
2 2019-08-10 Car Yellow
3 2019-08-15 Truck Yellow
4 2020-08-12 Truck Yellow
data = example.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
df1=pd.DataFrame(data)
df2 = df1.reset_index(level=['Column_1','Date'])
df2 = df2.rename(columns={'Date':'interval_year','Column_2':'Sum'})
In order to get this:
df2=
index interval_year Column_1 Sum
1 2019-12-31 Car 2
2 2019-12-31 Truck 1
3 2020-12-31 Car 1
I get the expected result but my code gives me a lot of headache. I create 2 additional DataFrames and sometimes, when I get 2 columns with same name (one as index), the code becomes even more complicated.
Any solution how to make this more efficient?
Thank you
CodePudding user response:
To reduce visible noise and to make your code more performant, I suggest you to do method chaining.
Try this :
df2 = (
example
.assign(Date= pd.to_datetime(df["Date"]))
.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
.reset_index()
.rename(columns={'Date':'interval_year','Column_2':'Sum'})
)
# Output :
print(df2)
interval_year Column_1 Sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1
CodePudding user response:
You can use pd.NamedAgg to do some renaming for you in the groupby like this:
example.groupby([pd.Grouper(key='Date', freq='Y'),'Column_1']).agg(sum=('Date','nunique')).reset_index()
Output:
Date Column_1 sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1