I have a Dataframe df1
(original table) with multiple columns.
I have a filtered DataFrame df2
with columns date
, agent_id
, gps1
, gps2
only 4 columns.
In df1
, I have date
, agent_id
and final_gps
along with other columns.
I want to filter all the data from df1
which exists in df2
, I want to compare based on.
Df1.date == df2.date & df1.agent_id == df2.agent_id & (df1.final_gps == df2.gps1 or df1.final_gps == df2.gps2)
df2 sample
date agent_id gps1 gps2
14-02-2020 12abc (1,2) (7,6)
14-02-2020 12abc (3,4) (7,6)
14-02-2020 33bcd (6,7) (8,9)
20-02-2020 44hgf (1,6) (3,7)
20-02-2020 12abc (3,5) (3,1)
20-02-2020 33bcd (3,4) (3,6)
21-02-2020 12abc (4,5) (5,4)
df1 sample
date agent_id final_gps ….
10-02-2020 12abc (1,2) …
10-02-2020 33bcd (8,9) …
14-02-2020 12abc (1,2) …
14-02-2020 12abc (7,6) …
14-02-2020 12abc (3,4) …
14-02-2020 33bcd (6,7) …
14-02-2020 33bcd (8,9) …
14-02-2020 33bcd (1,1) …
14-02-2020 33bcd (2,2) …
18-02-2020 12abc (1,2) …
19-02-2020 44hgf (6,7) …
20-02-2020 12abc (3,5) …
20-02-2020 12abc (3,1) …
20-02-2020 44hgf (1,6) …
20-02-2020 44hgf (3,7) …
required output:-
date agent_id final_gps ….
14-02-2020 12abc (1,2) …
14-02-2020 12abc (7,6) …
14-02-2020 12abc (3,4) …
14-02-2020 33bcd (6,7) …
14-02-2020 33bcd (8,9) …
20-02-2020 12abc (3,5) …
20-02-2020 12abc (3,1) …
20-02-2020 44hgf (1,6) …
20-02-2020 44hgf (3,7) …
I tried this but it is giving me all the matching records which exists in df2
, but I want strictly data only for those agent_id
on that particular date
and particular gps
matching condition from df1
.
df = df1[df1['date'].isin(df2['date']) &
df1['agent_id'].isin(df2['agent_id']) &
(df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
CodePudding user response:
Use DataFrame.melt
for reshape gps1
and gps2
to final_gps
first, so possible merge by all 3 columns (not necessary define on
), remove duplicates by all columns and last sorting:
df = (df2.melt(id_vars=['date','agent_id'],
value_vars=['gps1','gps2'],
value_name='final_gps')
.drop('variable', axis=1)
.merge(df1)
.drop_duplicates()
.sort_values(by=['date','agent_id'], ignore_index=True))
print (df)
date agent_id final_gps
0 14-02-2020 12abc (1,2)
1 14-02-2020 12abc (3,4)
2 14-02-2020 12abc (7,6)
3 14-02-2020 33bcd (6,7)
4 14-02-2020 33bcd (8,9)
5 20-02-2020 12abc (3,5)
6 20-02-2020 12abc (3,1)
7 20-02-2020 44hgf (1,6)
8 20-02-2020 44hgf (3,7)
Details:
print (df2.melt(id_vars=['date','agent_id'],
value_vars=['gps1','gps2'],
value_name='final_gps'))
date agent_id variable final_gps
0 14-02-2020 12abc gps1 (1,2)
1 14-02-2020 12abc gps1 (3,4)
2 14-02-2020 33bcd gps1 (6,7)
3 20-02-2020 44hgf gps1 (1,6)
4 20-02-2020 12abc gps1 (3,5)
5 20-02-2020 33bcd gps1 (3,4)
6 21-02-2020 12abc gps1 (4,5)
7 14-02-2020 12abc gps2 (7,6)
8 14-02-2020 12abc gps2 (7,6)
9 14-02-2020 33bcd gps2 (8,9)
10 20-02-2020 44hgf gps2 (3,7)
11 20-02-2020 12abc gps2 (3,1)
12 20-02-2020 33bcd gps2 (3,6)
13 21-02-2020 12abc gps2 (5,4)
CodePudding user response:
You could use multiple isin
and chain them using &
operator. Since final_gps
can be either gps1
or gps2
, we use |
operator in brackets:
out = (df1[df1['date'].isin(df2['date']) &
df1['agent_id'].isin(df2['agent_id']) &
(df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
.reset_index(drop=True))
Output:
date agent_id final_gps ….
0 14-02-2020 12abc (1, 2) …
1 14-02-2020 12abc (7, 6) …
2 14-02-2020 12abc (3, 4) …
3 14-02-2020 33bcd (6, 7) …
4 14-02-2020 33bcd (8, 9) …
5 20-02-2020 12abc (3, 5) …
6 20-02-2020 12abc (3, 1) …
7 20-02-2020 44hgf (1, 6) …
8 20-02-2020 44hgf (3, 7) …