I have a dataset with different columns. It looks like that:
ID = [97,97,97,19,19,33,33,33,33,33,11]
DATE = [2018-09,2018-09,2020-02,2021-01,2021-01,2017-04,2017-04,2017-04,2019-09,2019-09,2022-05]
Destination_1 = [BRAZIL,BRAZIL,BRAZIL,ARGENTINA,MOROCCO,INDONESIA,USA,BRAZIL,EGYPT,LEBANON,USA]
Destination_2 = [BRAZIL,URUGUAY,SINGAPORE,VENEZUELA,THAILAND,PERU,ECUADOR,USA,ALGERIA, EGYPT,CANADA]
I would like to create a new column OUTPUT based on the following condition:
for each ID if at the same DATE at least one of the DESTINATION_1 equals DESTINATION_2 then the output sould be TRUE otherwise FALSE.
The desired result for the OUTPUT column should then be:
OUTPUT=[TRUE, TRUE, FALSE , FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE]
meaning that if at the same DATE at least one destination is common to each ID then the condition is satisfied and the output is TRUE.
How could I do so? So far I tried:
for sublist in df["ID"]:
if (df["Destination_1"] == df["Destination_2"]).any():
print("True")
else :
print("False")
but this doesn't work for me and I do not get how to add the second condition concerning the DATE without getting an error.
Could someone help me please?
CodePudding user response:
You can try groupby
and use isin
to check value existence then assign
out = (df.groupby(['ID', 'DATE'])
.apply(lambda g: g.assign(OUTPUT=[g['Destination_1'].isin(g['Destination_2']).any()]*len(g))))
print(out)
ID DATE Destination_1 Destination_2 OUTPUT
0 97 2018-09 BRAZIL BRAZIL True
1 97 2018-09 BRAZIL URUGUAY True
2 97 2020-02 BRAZIL SINGAPORE False
3 19 2021-01 ARGENTINA VENEZUELA False
4 19 2021-01 MOROCCO THAILAND False
5 33 2017-04 INDONESIA PERU True
6 33 2017-04 USA ECUADOR True
7 33 2017-04 BRAZIL USA True
8 33 2019-09 EGYPT ALGERIA True
9 33 2019-09 LEBANON EGYPT True
10 11 2022-05 USA CANADA False