Consider the dataframe df_in = pd.DataFrame({'id': [0,1,2,2,3,4,4,4], 'channel': [0,1,0,1,3,2,3,1]})
> df_in
id channel
0 0 0
1 1 1
2 2 0
3 2 1
4 3 3
5 4 2
6 4 3
7 4 1
And a list of 2-tuples allowed_channel_couples = ((0,1),(2,3))
.
I'd like to obtain a new dataframe df_out
as:
id channel flag
0 0 0 0
1 1 1 0
2 2 0 1
3 2 1 1
4 3 3 0
5 4 2 1
6 4 3 1
7 4 1 0
Each rows in df_out
equals the rows in df_in
. However, df_out
contains a new column, named flag
. The flag
value of a row in df_out
is equal 1 only if another row with same id
exists whose channel value 'matches' the channel value of the first row, according to allowed_channel_couples
.
Hence consider the row with index 5 of df_in
. This row has id
equal 4 and channel
equal 2. Since the row at index 6 has id
equal 4 and channel
value equal 3, and (2,3)
is in allowed_channel
, both the row with index 5 and the row with index 6 are marked with the flag 1
.
On the other hand, the row with index 7 of df_in
has id
equal 4 and channel
equal 1. The allowed couple for flagging in this case is (0,1)
. No other rows with id
equal 4 and channel equal 0 exists, hence the row with index 7 is marked with a 0.
I'm having troubles in solving this problem avoiding python's slow loops. Any idea on how I can obtain the flags column efficiently?
Ideas:
- I've tried working with the
agg
method. Hence I'm able to obtain a table like:
id channel
0 0 [0]
1 1 [1]
2 2 [0, 1]
3 3 [3]
4 4 [2, 3, 1]
through df_in.groupby('id', as_index=False).agg({'channel': lambda x: tuple(x)})
.
CodePudding user response:
You can map the values from one of the tuple values to the other, then identify the duplicates:
d = dict(allowed_channel_couples)
# {0: 1, 2: 3}
df_in['flag'] = (df_in
.assign(channel=df_in['channel'].map(d).fillna(df_in['channel']))
.duplicated(['id', 'channel'], keep=False)
.astype(int)
)
output:
id channel flag
0 0 0 0
1 1 1 0
2 2 0 1
3 2 1 1
4 3 3 0
5 4 2 1
6 4 3 1
7 4 1 0