Home > OS >  Optimize groupby->pd.DataFrame->.reset_index->.rename(columns)
Optimize groupby->pd.DataFrame->.reset_index->.rename(columns)

Time:11-23

I am very new at this, so bear with me please.

I do this:

example=
index  Date         Column_1   Column_2
1      2019-06-17   Car        Red
2      2019-08-10   Car        Yellow
3      2019-08-15   Truck      Yellow
4      2020-08-12   Truck      Yellow


data = example.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
df1=pd.DataFrame(data)
df2 = df1.reset_index(level=['Column_1','Date'])
df2 = df2.rename(columns={'Date':'interval_year','Column_2':'Sum'})

In order to get this:

df2=
index  interval_year   Column_1   Sum
1      2019-12-31      Car        2
2      2019-12-31      Truck      1
3      2020-12-31      Car        1

I get the expected result but my code gives me a lot of headache. I create 2 additional DataFrames and sometimes, when I get 2 columns with same name (one as index), the code becomes even more complicated.

Any solution how to make this more efficient?

Thank you

CodePudding user response:

To reduce visible noise and to make your code more performant, I suggest you to do method chaining.

Try this :

df2 = (
        example
           .assign(Date= pd.to_datetime(df["Date"]))
           .groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
           .reset_index()
           .rename(columns={'Date':'interval_year','Column_2':'Sum'})
      )

# Output :

print(df2)

  interval_year Column_1  Sum
0    2019-12-31      Car    2
1    2019-12-31    Truck    1
2    2020-12-31    Truck    1

CodePudding user response:

You can use pd.NamedAgg to do some renaming for you in the groupby like this:

example.groupby([pd.Grouper(key='Date', freq='Y'),'Column_1']).agg(sum=('Date','nunique')).reset_index()

Output:

        Date Column_1  sum
0 2019-12-31      Car    2
1 2019-12-31    Truck    1
2 2020-12-31    Truck    1
  • Related