Suppose I have the following dataframe:
Id title field parent_field label
1 Hardwareentwickler (m/w) 4036 2172 Hardware-Entwicklung, Engineering
1 Hardwareentwickler (m/w) 2172 NaN IT, EDV
1 Hardwareentwickler (m/w) 3081 NaN Technik, Ingenieurwesen
1 Hardwareentwickler (m/w) 1000 NaN Consultant
1 Hardwareentwickler (m/w) 5000 1000 IT Consultant
2 Accountant 4321 NaN Finanz
2 Accountant 1234 NaN Kostenrechner
What i wanted to do is to delete all records, where field
is present in column parent_field
(grouped by ID/title).
At the end I want to get the following result:
Id title field parent_field label
1 Hardwareentwickler (m/w) 4036 2172 Hardware-Entwicklung, Engineering
1 Hardwareentwickler (m/w) 3081 NaN Technik, Ingenieurwesen
1 Hardwareentwickler (m/w) 5000 1000 IT Consultant
2 Accountant 4321 NaN Finanz
2 Accountant 1234 NaN Kostenrechner
where rows
Id title field parent_field label
1 Hardwareentwickler (m/w) 2172 NaN IT, EDV
1 Hardwareentwickler (m/w) 1000 NaN Consultant
are deleted, because field
values are presented in parent_field
of other rows(grouped by id)
CodePudding user response:
Easy solution
Group the dataframe by id
and title
and apply
a lambda function which checks the occurrence of field
in parent_field
to create a boolean mask
.
mask = df.groupby(['Id', 'title'], group_keys=False)\
.apply(lambda x: x['field'].isin(x['parent_field']))
>>> df[~mask]
Id title field parent_field label
0 1 Hardwareentwickler (m/w) 4036 2172.0 Hardware-Entwicklung, Engineering
2 1 Hardwareentwickler (m/w) 3081 NaN Technik, Ingenieurwesen
4 1 Hardwareentwickler (m/w) 5000 1000.0 IT Consultant
5 2 Accountant 4321 NaN Finanz
6 2 Accountant 1234 NaN Kostenrechner
CodePudding user response:
A less one liner but with a classical join approach:
df_copy = df.copy()
df_copy ['marker'] = 1
# left join to add a marker where the macthed condition.
joined = pd.merge(df, df_copy[['parent_field','marker']], left_on='field', rigth_on='parent_field', how='left')
# drop where is marker == 1 so that field and parent field are matched.
joined[joined.marker.isna()]
the advantage here is you have a marker, so you can do an analysis based on the discarded on the same df.
CodePudding user response:
If I understand correctly, let's suppose your dataframe is called 'df'
df[~df['field'].isin(list(df['parent_field']))]