Grouping columns, then deleting values based on null column values-CodePudding

I am reposting since my last question was poorly worded. I have a table that looks like the following:

 ------ ------ ------ ------ ------------ ---------- 
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
 ------ ------ ------ ------ ------------ ---------- 
|  100 |   10 | A    | 2018 |         12 |          |
|  100 |   11 | B    | 2019 |            | multiple |
|  100 |   12 | C    | 2019 |            | multiple |
|  100 |   13 | D    | 2019 |            |          |
|  200 |   10 | A    | 2018 |            |          |
|  200 |   11 | B    | 2019 |            | multiple |
|  200 |   12 | C    | 2019 |            | multiple |
|  200 |   13 | D    | 2019 |            |          |
 ------ ------ ------ ------ ------------ ----------

I am trying to delete the "multiple" strings inside the "Multiple" column where the ID group does not have a Identified value. For example, the first group of ID 1 == 100 contains a not-null Identified value so we can leave the "multiple" values. However, the ID 1 == 200 group has no Identified values, so I would like to remove the "multiple" values that appear in this group, giving us the following dataframe.

 ------ ------ ------ ------ ------------ ---------- 
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
 ------ ------ ------ ------ ------------ ---------- 
|  100 |   10 | A    | 2018 |         12 |          |
|  100 |   11 | B    | 2019 |            | multiple |
|  100 |   12 | C    | 2019 |            | multiple |
|  100 |   13 | D    | 2019 |            |          |
|  200 |   10 | A    | 2018 |            |          |
|  200 |   11 | B    | 2019 |            |          |
|  200 |   12 | C    | 2019 |            |          |
|  200 |   13 | D    | 2019 |            |          |
 ------ ------ ------ ------ ------------ ----------

Please let me know if I can rephrase my question.

EDIT: if both Identified and Multiple columns are blank, then leave blank.

CodePudding user response：

Assuming your dataframe can be built with the same types than this one (note for next time how you can provide a sample dataframe as one can't directly copy yours)

df= pd.DataFrame({
    'ID 1': [100]*4 [200]*4,
    'other_cols':range(8),
    'Identified':['12'] ['']*7,
    'Multiple':['','multiple','multiple','']*2
})
print(df)
#    ID 1  other_cols  Identified  Multiple
# 0   100           0          12          
# 1   100           1              multiple
# 2   100           2              multiple
# 3   100           3                      
# 4   200           4                      
# 5   200           5              multiple
# 6   200           6              multiple
# 7   200           7

So to do the job, check where Identified if not equal (ne) to blank. groupby.transform with any to get True for all the same ID if one is not blank. Use the reverse (~) of this mask to select the IDs with only blank and assign blank in the other column.

# greate a mask with True where at least one non blank value per ID
mask = df['Identified'].ne('').groupby(df['ID 1']).transform(any)

# reverse the mask and assign blank 
df.loc[~mask, 'Multiple'] = ''

print(df)
#    ID 1  other_cols  Identified  Multiple
# 0   100           0          12          
# 1   100           1              multiple
# 2   100           2              multiple
# 3   100           3                      
# 4   200           4                      
# 5   200           5                      
# 6   200           6                      
# 7   200           7