Home > Software design >  flag rows in a dataframe conditionally on values of other rows
flag rows in a dataframe conditionally on values of other rows

Time:06-02

Consider the dataframe df_in = pd.DataFrame({'id': [0,1,2,2,3,4,4,4], 'channel': [0,1,0,1,3,2,3,1]})

> df_in

   id  channel
0   0        0
1   1        1
2   2        0
3   2        1
4   3        3
5   4        2
6   4        3
7   4        1

And a list of 2-tuples allowed_channel_couples = ((0,1),(2,3)).

I'd like to obtain a new dataframe df_out as:

   id  channel  flag
0   0        0     0
1   1        1     0
2   2        0     1
3   2        1     1
4   3        3     0
5   4        2     1
6   4        3     1
7   4        1     0

Each rows in df_out equals the rows in df_in. However, df_out contains a new column, named flag. The flag value of a row in df_out is equal 1 only if another row with same id exists whose channel value 'matches' the channel value of the first row, according to allowed_channel_couples.

Hence consider the row with index 5 of df_in. This row has id equal 4 and channel equal 2. Since the row at index 6 has id equal 4 and channel value equal 3, and (2,3) is in allowed_channel, both the row with index 5 and the row with index 6 are marked with the flag 1. On the other hand, the row with index 7 of df_in has id equal 4 and channel equal 1. The allowed couple for flagging in this case is (0,1). No other rows with id equal 4 and channel equal 0 exists, hence the row with index 7 is marked with a 0.

I'm having troubles in solving this problem avoiding python's slow loops. Any idea on how I can obtain the flags column efficiently?

Ideas:

  • I've tried working with the agg method. Hence I'm able to obtain a table like:
   id    channel
0   0        [0]
1   1        [1]
2   2     [0, 1]
3   3        [3]
4   4  [2, 3, 1]

through df_in.groupby('id', as_index=False).agg({'channel': lambda x: tuple(x)}).

CodePudding user response:

You can map the values from one of the tuple values to the other, then identify the duplicates:

d = dict(allowed_channel_couples)
# {0: 1, 2: 3}

df_in['flag'] = (df_in
 .assign(channel=df_in['channel'].map(d).fillna(df_in['channel']))
 .duplicated(['id', 'channel'], keep=False)
 .astype(int)
)

output:

   id  channel  flag
0   0        0     0
1   1        1     0
2   2        0     1
3   2        1     1
4   3        3     0
5   4        2     1
6   4        3     1
7   4        1     0
  • Related