I am reposting since my last question was poorly worded. I have a table that looks like the following:
------ ------ ------ ------ ------------ ----------
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
------ ------ ------ ------ ------------ ----------
| 100 | 10 | A | 2018 | 12 | |
| 100 | 11 | B | 2019 | | multiple |
| 100 | 12 | C | 2019 | | multiple |
| 100 | 13 | D | 2019 | | |
| 200 | 10 | A | 2018 | | |
| 200 | 11 | B | 2019 | | multiple |
| 200 | 12 | C | 2019 | | multiple |
| 200 | 13 | D | 2019 | | |
------ ------ ------ ------ ------------ ----------
I am trying to delete the "multiple" strings inside the "Multiple" column where the ID
group does not have a Identified
value. For example, the first group of ID 1 == 100
contains a not-null Identified
value so we can leave the "multiple" values. However, the ID 1 == 200
group has no Identified
values, so I would like to remove the "multiple" values that appear in this group, giving us the following dataframe.
------ ------ ------ ------ ------------ ----------
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
------ ------ ------ ------ ------------ ----------
| 100 | 10 | A | 2018 | 12 | |
| 100 | 11 | B | 2019 | | multiple |
| 100 | 12 | C | 2019 | | multiple |
| 100 | 13 | D | 2019 | | |
| 200 | 10 | A | 2018 | | |
| 200 | 11 | B | 2019 | | |
| 200 | 12 | C | 2019 | | |
| 200 | 13 | D | 2019 | | |
------ ------ ------ ------ ------------ ----------
Please let me know if I can rephrase my question.
EDIT: if both Identified
and Multiple
columns are blank, then leave blank.
CodePudding user response:
Assuming your dataframe can be built with the same types than this one (note for next time how you can provide a sample dataframe as one can't directly copy yours)
df= pd.DataFrame({
'ID 1': [100]*4 [200]*4,
'other_cols':range(8),
'Identified':['12'] ['']*7,
'Multiple':['','multiple','multiple','']*2
})
print(df)
# ID 1 other_cols Identified Multiple
# 0 100 0 12
# 1 100 1 multiple
# 2 100 2 multiple
# 3 100 3
# 4 200 4
# 5 200 5 multiple
# 6 200 6 multiple
# 7 200 7
So to do the job, check where Identified if not equal (ne
) to blank. groupby.transform
with any
to get True for all the same ID if one is not blank. Use the reverse (~
) of this mask to select the IDs with only blank and assign blank in the other column.
# greate a mask with True where at least one non blank value per ID
mask = df['Identified'].ne('').groupby(df['ID 1']).transform(any)
# reverse the mask and assign blank
df.loc[~mask, 'Multiple'] = ''
print(df)
# ID 1 other_cols Identified Multiple
# 0 100 0 12
# 1 100 1 multiple
# 2 100 2 multiple
# 3 100 3
# 4 200 4
# 5 200 5
# 6 200 6
# 7 200 7