Optimize groupby->pd.DataFrame->.reset_index->.rename(columns)-CodePudding

I am very new at this, so bear with me please.

I do this:

example=
index  Date         Column_1   Column_2
1      2019-06-17   Car        Red
2      2019-08-10   Car        Yellow
3      2019-08-15   Truck      Yellow
4      2020-08-12   Truck      Yellow


data = example.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
df1=pd.DataFrame(data)
df2 = df1.reset_index(level=['Column_1','Date'])
df2 = df2.rename(columns={'Date':'interval_year','Column_2':'Sum'})

In order to get this:

df2=
index  interval_year   Column_1   Sum
1      2019-12-31      Car        2
2      2019-12-31      Truck      1
3      2020-12-31      Car        1

I get the expected result but my code gives me a lot of headache. I create 2 additional DataFrames and sometimes, when I get 2 columns with same name (one as index), the code becomes even more complicated.

Any solution how to make this more efficient?

Thank you

CodePudding user response：

To reduce visible noise and to make your code more performant, I suggest you to do method chaining.

Try this :

df2 = (
        example
           .assign(Date= pd.to_datetime(df["Date"]))
           .groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
           .reset_index()
           .rename(columns={'Date':'interval_year','Column_2':'Sum'})
      )

# Output :

print(df2)

  interval_year Column_1  Sum
0    2019-12-31      Car    2
1    2019-12-31    Truck    1
2    2020-12-31    Truck    1

CodePudding user response：

You can use pd.NamedAgg to do some renaming for you in the groupby like this:

example.groupby([pd.Grouper(key='Date', freq='Y'),'Column_1']).agg(sum=('Date','nunique')).reset_index()

Output:

        Date Column_1  sum
0 2019-12-31      Car    2
1 2019-12-31    Truck    1
2 2020-12-31    Truck    1