Home > Software engineering >  Create an indicator column if a column contains many string values in pandas
Create an indicator column if a column contains many string values in pandas

Time:02-02

I have a pandas dataframe that looks like this:

import pandas as pd
pd.DataFrame({'id': [1,1,1,2,2,3,3,3],
              'col': ['a','a','a','a','b','c','b','a']})

id  col
0   1   a
1   1   a
2   1   a
3   2   a
4   2   b
5   3   c
6   3   b
7   3   a

I would like to create an indicator column which will tell me, if an id has both "a" and "b" in the col

The output should look like this:

pd.DataFrame({'id': [1,1,1,2,2,3,3,3],
                  'col': ['a','a','a','a','b','c','b','a'], 
'indicator': [0,0,0,1,1,1,1,1]})


id  col indicator
0   1   a   0
1   1   a   0
2   1   a   0
3   2   a   1
4   2   b   1
5   3   c   1
6   3   b   1
7   3   a   1

How can I do that in pandas ?

CodePudding user response:

You can compare sets per groups in lambda function in GroupBy.transform:

df['indicator'] = (df.groupby('id')['col']
                     .transform(lambda x: set(x) >= {'a','b'})
                     .astype(int))
print (df)
   id col  indicator
0   1   a          0
1   1   a          0
2   1   a          0
3   2   a          1
4   2   b          1
5   3   c          1
6   3   b          1
7   3   a          1

Or get same values in both filtered id and mapping:

both = np.intersect1d(df.loc[df['col'].eq('a'), 'id'],
                      df.loc[df['col'].eq('b'), 'id'])

df['indicator'] = df['id'].isin(both).astype(int)

Or mapping compared sets:

df['indicator'] = (df['id'].map(df.groupby('id')['col']
                                  .agg(set).ge({'a','b'}).astype(int)))

CodePudding user response:

Use set operations: groupby.transform with set.issubset

df['indicator'] = (df.groupby("id")['col']
                     .transform({'a', 'b'}.issubset)
                     .astype(int)
                   )

Or:

df['indicator'] = (df.groupby("id")['col']
                     .transform(lambda g: int({'a', 'b'}.issubset(g)))
                   )

Alternative approach, use a crosstab with reindex and aggregation with all:

s = pd.crosstab(df['id'], df['col']).reindex(columns=['a', 'b']).all(axis=1)

df['indicator'] = df['id'].isin(s.index[s]).astype(int)

Output:

   id col  indicator
0   1   a          0
1   1   a          0
2   1   a          0
3   2   a          1
4   2   b          1
5   3   c          1
6   3   b          1
7   3   a          1
  • Related