Home > OS >  Select rows for which condition applies for multiple rows in long format with groupby pandas
Select rows for which condition applies for multiple rows in long format with groupby pandas

Time:09-20

I have a pandas dataframe that looks like this:

import pandas as pd
foo = pd.DataFrame({'id_p': [1,1,2,2,3,3,3,4,4], 
              'id_d_b': [True, True, False, True, True, True,False,False,False], 
              'id_d_i': [False, False, True, False,False,False,True,True,True]})
foo

id_p    id_d_b  id_d_i
0   1   True    False
1   1   True    False
2   2   False   True
3   2   True    False
4   3   True    False
5   3   True    False
6   3   False   True
7   4   False   True
8   4   False   True

I would like to select the id_ps which has have at least one True at id_d_b and at least one True at id_d_i

I tried this

foo['id_d_b'] = foo['id_d_b'].astype(int)
foo['id_d_i'] = foo['id_d_i'].astype(int)

foo['has_id_d_b'] = foo.groupby('id_p')['id_d_b'].transform('max')
foo['has_id_d_i'] = foo.groupby('id_p')['id_d_i'].transform('max')

foo['result'] = foo['has_id_d_b']   foo['has_id_d_i'] # if this is >1 then that specific id_p has at least one id_d_b and id_d_i
foo['result'] = foo.eval('result > 1')
foo

id_p    id_d_b  id_d_i  has_id_d_b  has_id_d_i  result
0   1   1   0   1   0   False
1   1   1   0   1   0   False
2   2   0   1   1   1   True
3   2   1   0   1   1   True
4   3   1   0   1   1   True
5   3   1   0   1   1   True
6   3   0   1   1   1   True
7   4   0   1   0   1   False
8   4   0   1   0   1   False

which gives the correct result, but I am looking for a "one-liner" solution

CodePudding user response:

You can use: groupby.any to check if any value per group is True, (eventually on a subset of columns), then ensure all columns are True with all. With isin you can get the relevant indices (for slicing or indexing if needed).

m = foo.groupby('id_p').any().all(1)
# if other columns
# m = foo.groupby('id_p')[['id_d_b', 'id_d_i']].any().all(1)

foo['result'] = foo['id_p'].isin(m[m].index)

For the indices:

print(m[m].index)
# Int64Index([2, 3], dtype='int64', name='id_p')

or, using groupby.transform, which will be more expensive:

foo['result'] = foo.groupby('id_p').transform('any').all(1)
# if other columns
# foo['result'] = this foo.groupby('id_p')[['id_d_b', 'id_d_i']].transform('any').all(1)

output:

   id_p  id_d_b  id_d_i  result
0     1    True   False   False
1     1    True   False   False
2     2   False    True    True
3     2    True   False    True
4     3    True   False    True
5     3    True   False    True
6     3   False    True    True
7     4   False    True   False
8     4   False    True   False

CodePudding user response:

Use Series.map with Series created by aggregate GroupBy.any with DataFrame.all:

foo['result'] = foo['id_p'].map(foo.groupby('id_p')[['id_d_b', 'id_d_i']].any().all(axis=1))
print (foo)
   id_p  id_d_b  id_d_i  result
0     1    True   False   False
1     1    True   False   False
2     2   False    True    True
3     2    True   False    True
4     3    True   False    True
5     3    True   False    True
6     3   False    True    True
7     4   False    True   False
8     4   False    True   False
  • Related