Updating grouped data based on date-CodePudding

I have a dataframe that looks like the following:

df = pd.DataFrame({'ID': ['1', '1', '1', '2', '2', '2'], 
                   'ID2': ['25', '29', '56', '562', '92', '170'],
                   'Origin': ['2005', '2010', '2020', '1995', '1999', '2007'],
                   'Status' : ['Done', 'Unfinished', 'Done', 'Done', 'Done', 'Unfinished']
                  })
df


 ---- ----- -------- ------------ 
| ID | ID2 | Origin |   Status   |
 ---- ----- -------- ------------ 
|  1 |  25 |   2005 | Done       |
|  1 |  29 |   2010 | Unfinished |
|  1 |  56 |   2020 | Done       |
|  2 | 562 |   1995 | Done       |
|  2 |  92 |   1999 | Done       |
|  2 | 170 |   2007 | Unfinished |
 ---- ----- -------- ------------

I am trying to separate into groups based on ID and update the Status column from Unfinished to Done if there is another row in the same group with a newer Done value. Essentially, if the Unfinished row is the most recent within the group, then leave the same. If Unfinished is not the most recent, change to Done.

Sample output below:


 ---- ----- -------- ------------ ---------------- 
| ID | ID2 | Origin |   Status   | Updated_Status |
 ---- ----- -------- ------------ ---------------- 
|  1 |  25 |   2005 | Done       |                |
|  1 |  29 |   2010 | Unfinished | Done           |
|  1 |  56 |   2020 | Done       |                |
|  2 | 562 |   1995 | Done       |                |
|  2 |  92 |   1999 | Done       |                |
|  2 | 170 |   2007 | Unfinished |                |
 ---- ----- -------- ------------ ----------------

After I finish df.groupby('ID') I know I have to try some variation of df.loc[(df['Status'] == 'Unfinished') & (df['Origin'] > today.year)] but I can't think of the proper logic or syntax to properly adjust this dataframe.

All help appreciated.

CodePudding user response：

I think you can test if shifted values per groups by

Must add this to the dataframe,,

CodePudding user response：

It is as simple as

mask = (df.groupby('ID')['Status'].transform('last') == 'Done') & (df.Status == 'Unfinished')
df['Updated_Status'] = np.where(mask, 'Done', np.nan)

The last row in each group corresponds to the latest entry. Use that to transform for each Status. Then, use the condition for updating the status.