Home > Back-end >  Grouping columns, then deleting values based on null column values
Grouping columns, then deleting values based on null column values

Time:04-01

I am reposting since my last question was poorly worded. I have a table that looks like the following:

 ------ ------ ------ ------ ------------ ---------- 
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
 ------ ------ ------ ------ ------------ ---------- 
|  100 |   10 | A    | 2018 |         12 |          |
|  100 |   11 | B    | 2019 |            | multiple |
|  100 |   12 | C    | 2019 |            | multiple |
|  100 |   13 | D    | 2019 |            |          |
|  200 |   10 | A    | 2018 |            |          |
|  200 |   11 | B    | 2019 |            | multiple |
|  200 |   12 | C    | 2019 |            | multiple |
|  200 |   13 | D    | 2019 |            |          |
 ------ ------ ------ ------ ------------ ---------- 

I am trying to delete the "multiple" strings inside the "Multiple" column where the ID group does not have a Identified value. For example, the first group of ID 1 == 100 contains a not-null Identified value so we can leave the "multiple" values. However, the ID 1 == 200 group has no Identified values, so I would like to remove the "multiple" values that appear in this group, giving us the following dataframe.

 ------ ------ ------ ------ ------------ ---------- 
| ID 1 | ID 2 | Type | Year | Identified | Multiple |
 ------ ------ ------ ------ ------------ ---------- 
|  100 |   10 | A    | 2018 |         12 |          |
|  100 |   11 | B    | 2019 |            | multiple |
|  100 |   12 | C    | 2019 |            | multiple |
|  100 |   13 | D    | 2019 |            |          |
|  200 |   10 | A    | 2018 |            |          |
|  200 |   11 | B    | 2019 |            |          |
|  200 |   12 | C    | 2019 |            |          |
|  200 |   13 | D    | 2019 |            |          |
 ------ ------ ------ ------ ------------ ---------- 

Please let me know if I can rephrase my question.

EDIT: if both Identified and Multiple columns are blank, then leave blank.

CodePudding user response:

Assuming your dataframe can be built with the same types than this one (note for next time how you can provide a sample dataframe as one can't directly copy yours)

df= pd.DataFrame({
    'ID 1': [100]*4 [200]*4,
    'other_cols':range(8),
    'Identified':['12'] ['']*7,
    'Multiple':['','multiple','multiple','']*2
})
print(df)
#    ID 1  other_cols  Identified  Multiple
# 0   100           0          12          
# 1   100           1              multiple
# 2   100           2              multiple
# 3   100           3                      
# 4   200           4                      
# 5   200           5              multiple
# 6   200           6              multiple
# 7   200           7                      

So to do the job, check where Identified if not equal (ne) to blank. groupby.transform with any to get True for all the same ID if one is not blank. Use the reverse (~) of this mask to select the IDs with only blank and assign blank in the other column.

# greate a mask with True where at least one non blank value per ID
mask = df['Identified'].ne('').groupby(df['ID 1']).transform(any)

# reverse the mask and assign blank 
df.loc[~mask, 'Multiple'] = ''

print(df)
#    ID 1  other_cols  Identified  Multiple
# 0   100           0          12          
# 1   100           1              multiple
# 2   100           2              multiple
# 3   100           3                      
# 4   200           4                      
# 5   200           5                      
# 6   200           6                      
# 7   200           7                      
  • Related