Remove rows based on value in other column: pandas groupby-CodePudding

Suppose I have the following dataframe:

Id       title                  field  parent_field            label
1   Hardwareentwickler (m/w)    4036     2172           Hardware-Entwicklung, Engineering
1   Hardwareentwickler (m/w)    2172     NaN            IT, EDV
1   Hardwareentwickler (m/w)    3081     NaN            Technik, Ingenieurwesen
1   Hardwareentwickler (m/w)    1000     NaN            Consultant
1   Hardwareentwickler (m/w)    5000    1000            IT Consultant
2   Accountant                  4321     NaN            Finanz
2   Accountant                  1234     NaN            Kostenrechner

What i wanted to do is to delete all records, where field is present in column parent_field(grouped by ID/title).

At the end I want to get the following result:

Id       title                  field  parent_field            label
1   Hardwareentwickler (m/w)    4036     2172           Hardware-Entwicklung, Engineering
1   Hardwareentwickler (m/w)    3081     NaN            Technik, Ingenieurwesen
1   Hardwareentwickler (m/w)    5000    1000            IT Consultant
2   Accountant                  4321     NaN            Finanz
2   Accountant                  1234     NaN            Kostenrechner

where rows

Id       title                  field  parent_field            label
1   Hardwareentwickler (m/w)    2172     NaN            IT, EDV
1   Hardwareentwickler (m/w)    1000     NaN            Consultant

are deleted, because field values are presented in parent_field of other rows(grouped by id)

CodePudding user response：

Easy solution

Group the dataframe by id and title and apply a lambda function which checks the occurrence of field in parent_field to create a boolean mask.

mask = df.groupby(['Id', 'title'], group_keys=False)\
         .apply(lambda x: x['field'].isin(x['parent_field']))

>>> df[~mask]

   Id                     title  field  parent_field                              label
0   1  Hardwareentwickler (m/w)   4036        2172.0  Hardware-Entwicklung, Engineering
2   1  Hardwareentwickler (m/w)   3081           NaN            Technik, Ingenieurwesen
4   1  Hardwareentwickler (m/w)   5000        1000.0                      IT Consultant
5   2                Accountant   4321           NaN                             Finanz
6   2                Accountant   1234           NaN                      Kostenrechner

CodePudding user response：

A less one liner but with a classical join approach:

df_copy = df.copy()
df_copy ['marker'] = 1

# left join to add a marker where the macthed condition.
joined = pd.merge(df, df_copy[['parent_field','marker']], left_on='field', rigth_on='parent_field', how='left')

# drop where is marker == 1 so that field and parent field are matched.
joined[joined.marker.isna()]

the advantage here is you have a marker, so you can do an analysis based on the discarded on the same df.

CodePudding user response：

If I understand correctly, let's suppose your dataframe is called 'df'

df[~df['field'].isin(list(df['parent_field']))]