Home > Software engineering >  How to aggregate DataFrame to stay rows with the highest date and add new column in Python Pandas?
How to aggregate DataFrame to stay rows with the highest date and add new column in Python Pandas?

Time:06-02

I have DataFrame in Python Pandas like below ("date_col" is in "datetime64" format):

ID  | date_col   | purchase
----|------------|-------
111 | 2019-01-05 | apple
111 | 2019-05-22 | onion
222 | 2020-11-04 | banana
333 | 2020-04-19 | orange

I need to aggregate above table in the following way:

  1. add column "col1" with number of purchases which was made by client ("ID")
  2. If some client ("ID") is duplicated - stay only one row with the highest date

So as a result I need something like below:

ID  | date_col   | purchase | col1
----|------------|----------|-----
111 | 2019-05-22 | onion    | 2
222 | 2020-11-04 | banana   | 1
333 | 2020-04-19 | orange   | 1

CodePudding user response:

Assuming the dataframe is sorted on date_col column, you can use groupby:

g = df.groupby('ID', as_index=False)
g.last().merge(g.size())

    ID    date_col purchase  size
0  111  2019-05-22    onion     2
1  222  2020-11-04   banana     1
2  333  2020-04-19   orange     1

CodePudding user response:

here is one way:

df['col1'] = df.groupby('ID')['ID'].transform('count')
df = df.sort_values('date_col').groupby('ID').tail(1)

output:

>>
    ID    date_col purchase  col1
1  111  2019-05-22    onion     2
3  333  2020-04-19   orange     1
2  222  2020-11-04   banana     1

CodePudding user response:

You can try create a new count column using groupby.transform and get the max date by selecting with groupby.idmax

df['date_col'] = pd.to_datetime(df['date_col'])
df = (df.assign(col1=df.groupby('ID')['purchase'].transform('count'))
      .loc[lambda df: df.groupby('ID')['date_col'].idxmax()])
print(df)

    ID   date_col purchase  col1
1  111 2019-05-22    onion     2
2  222 2020-11-04   banana     1
3  333 2020-04-19   orange     1

CodePudding user response:

df['col1'] = df.groupby('ID')['ID'].transform('count')
df.sort_values('date_col').drop_duplicates('ID',keep='last')
  • Related