Home > Software engineering >  Groupby and filter by max value in pandas
Groupby and filter by max value in pandas

Time:03-24

I am working on a dataframe that looks like the following:

 ------- ---- ------ -- 
| Value | ID | Date |  |
 ------- ---- ------ -- 
|     1 |  5 | 2012 |  |
|     1 |  5 | 2013 |  |
|     0 | 12 | 2017 |  |
|     0 | 12 | 2022 |  |
|     1 | 27 | 2005 |  |
|     1 | 27 | 2011 |  |
 ------- ---- ------ -- 

Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:

 ------- ---- ------ -------- 
| Value | ID | Date | Latest |
 ------- ---- ------ -------- 
|     1 |  5 | 2012 |        |
|     1 |  5 | 2013 | Latest |
|     0 | 12 | 2017 |        |
|     0 | 12 | 2022 |        |
|     1 | 27 | 2005 |        |
|     1 | 27 | 2011 | Latest |
 ------- ---- ------ -------- 

The syntax of pandas is throwing me off as I am fairly new to Python.

In R I suppose I would be trying something like

df %>% select(Value == "1") %>% group_by(ID) %>% select(max(Date)

but I am not sure of the syntax in Pandas...I am trying to first select the subset of rows that meets the criteria "value == 1" by using

q = df.query('Value == 1')
my_query_index = q.index
my_query_index

This returns the index of all the rows but I am not sure how to incorporate this into the dataframe before grouping and filtering by max(date).

All help appreciated. Thank you.

CodePudding user response:

You can do this:

latest = df.query('Value==1').groupby("ID").max("year").assign(Latest = "Latest")
pd.merge(df,latest,how="outer")

   Value  ID  Date  Latest
0      1   5  2012     NaN
1      1   5  2013  Latest
2      0  12  2017     NaN
3      0  12  2022     NaN
4      1  27  2005     NaN
5      1  27  2011  Latest

CodePudding user response:

  1. Sort by 'ID' then by 'Date'
  2. Use duplicated(keep='last') to identify the last item in each group
  3. loc to assign in the right spot

df = df.sort_values(['ID', 'Date'])
mask1 = df.Value.eq(1)
mask2 = ~df.ID.duplicated(keep='last')

df.loc[mask1 & mask2, 'Latest'] = 'Latest'

df

   Value  ID  Date  Latest
0      1   5  2012     NaN
1      1   5  2013  Latest
2      0  12  2017     NaN
3      0  12  2022     NaN
4      1  27  2005     NaN
5      1  27  2011  Latest

CodePudding user response:

One option is to groupby, using transform to get the max, then use a conditional statement with np.where to get the output:

max_values = df.groupby("ID").Date.transform("max")
df.assign(Latest=np.where(df.Date.eq(max_values) & df.Value.eq(1), "Latest", ""))

   Value  ID  Date  Latest
0      1   5  2012        
1      1   5  2013  Latest
2      0  12  2017        
3      0  12  2022        
4      1  27  2005        
5      1  27  2011  Latest
  • Related