I am working on a dataframe that looks like the following:
------- ---- ------ --
| Value | ID | Date | |
------- ---- ------ --
| 1 | 5 | 2012 | |
| 1 | 5 | 2013 | |
| 0 | 12 | 2017 | |
| 0 | 12 | 2022 | |
| 1 | 27 | 2005 | |
| 1 | 27 | 2011 | |
------- ---- ------ --
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
------- ---- ------ --------
| Value | ID | Date | Latest |
------- ---- ------ --------
| 1 | 5 | 2012 | |
| 1 | 5 | 2013 | Latest |
| 0 | 12 | 2017 | |
| 0 | 12 | 2022 | |
| 1 | 27 | 2005 | |
| 1 | 27 | 2011 | Latest |
------- ---- ------ --------
The syntax of pandas is throwing me off as I am fairly new to Python.
In R I suppose I would be trying something like
df %>% select(Value == "1") %>% group_by(ID) %>% select(max(Date)
but I am not sure of the syntax in Pandas...I am trying to first select the subset of rows that meets the criteria "value == 1" by using
q = df.query('Value == 1')
my_query_index = q.index
my_query_index
This returns the index of all the rows but I am not sure how to incorporate this into the dataframe before grouping and filtering by max(date).
All help appreciated. Thank you.
CodePudding user response:
You can do this:
latest = df.query('Value==1').groupby("ID").max("year").assign(Latest = "Latest")
pd.merge(df,latest,how="outer")
Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest
CodePudding user response:
- Sort by
'ID'
then by'Date'
- Use
duplicated(keep='last')
to identify the last item in each group loc
to assign in the right spot
df = df.sort_values(['ID', 'Date'])
mask1 = df.Value.eq(1)
mask2 = ~df.ID.duplicated(keep='last')
df.loc[mask1 & mask2, 'Latest'] = 'Latest'
df
Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest
CodePudding user response:
One option is to groupby, using transform
to get the max, then use a conditional statement with np.where to get the output:
max_values = df.groupby("ID").Date.transform("max")
df.assign(Latest=np.where(df.Date.eq(max_values) & df.Value.eq(1), "Latest", ""))
Value ID Date Latest
0 1 5 2012
1 1 5 2013 Latest
2 0 12 2017
3 0 12 2022
4 1 27 2005
5 1 27 2011 Latest