Home > Enterprise >  Filter data based on multiple columns in second dataframe python
Filter data based on multiple columns in second dataframe python

Time:03-11

I have a Dataframe df1 (original table) with multiple columns.

I have a filtered DataFrame df2 with columns date, agent_id, gps1, gps2 only 4 columns.

In df1, I have date, agent_id and final_gps along with other columns.

I want to filter all the data from df1 which exists in df2, I want to compare based on.

Df1.date == df2.date & df1.agent_id == df2.agent_id & (df1.final_gps == df2.gps1 or df1.final_gps == df2.gps2)

df2 sample

date       agent_id  gps1    gps2
14-02-2020  12abc   (1,2)   (7,6)
14-02-2020  12abc   (3,4)   (7,6)
14-02-2020  33bcd   (6,7)   (8,9)
20-02-2020  44hgf   (1,6)   (3,7)
20-02-2020  12abc   (3,5)   (3,1)
20-02-2020  33bcd   (3,4)   (3,6)
21-02-2020  12abc   (4,5)   (5,4)

df1 sample

date       agent_id final_gps   ….
10-02-2020  12abc   (1,2)       …
10-02-2020  33bcd   (8,9)       …
14-02-2020  12abc   (1,2)       …
14-02-2020  12abc   (7,6)       …
14-02-2020  12abc   (3,4)       …
14-02-2020  33bcd   (6,7)       …
14-02-2020  33bcd   (8,9)       …
14-02-2020  33bcd   (1,1)       …
14-02-2020  33bcd   (2,2)       …
18-02-2020  12abc   (1,2)       …
19-02-2020  44hgf   (6,7)       …
20-02-2020  12abc   (3,5)       …
20-02-2020  12abc   (3,1)       …
20-02-2020  44hgf   (1,6)       …
20-02-2020  44hgf   (3,7)       …

required output:-

date       agent_id final_gps   ….
14-02-2020  12abc   (1,2)       …
14-02-2020  12abc   (7,6)       …
14-02-2020  12abc   (3,4)       …
14-02-2020  33bcd   (6,7)       …
14-02-2020  33bcd   (8,9)       …
20-02-2020  12abc   (3,5)       …
20-02-2020  12abc   (3,1)       …
20-02-2020  44hgf   (1,6)       …
20-02-2020  44hgf   (3,7)       …

I tried this but it is giving me all the matching records which exists in df2, but I want strictly data only for those agent_id on that particular date and particular gps matching condition from df1.

df = df1[df1['date'].isin(df2['date']) & 
         df1['agent_id'].isin(df2['agent_id']) & 
        (df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]

CodePudding user response:

Use DataFrame.melt for reshape gps1 and gps2 to final_gps first, so possible merge by all 3 columns (not necessary define on), remove duplicates by all columns and last sorting:

df = (df2.melt(id_vars=['date','agent_id'],
               value_vars=['gps1','gps2'],
               value_name='final_gps')
        .drop('variable', axis=1)
        .merge(df1)
        .drop_duplicates()
        .sort_values(by=['date','agent_id'], ignore_index=True))
print (df)
         date agent_id final_gps
0  14-02-2020    12abc     (1,2)
1  14-02-2020    12abc     (3,4)
2  14-02-2020    12abc     (7,6)
3  14-02-2020    33bcd     (6,7)
4  14-02-2020    33bcd     (8,9)
5  20-02-2020    12abc     (3,5)
6  20-02-2020    12abc     (3,1)
7  20-02-2020    44hgf     (1,6)
8  20-02-2020    44hgf     (3,7)

Details:

print (df2.melt(id_vars=['date','agent_id'],
               value_vars=['gps1','gps2'],
               value_name='final_gps'))

          date agent_id variable final_gps
0   14-02-2020    12abc     gps1     (1,2)
1   14-02-2020    12abc     gps1     (3,4)
2   14-02-2020    33bcd     gps1     (6,7)
3   20-02-2020    44hgf     gps1     (1,6)
4   20-02-2020    12abc     gps1     (3,5)
5   20-02-2020    33bcd     gps1     (3,4)
6   21-02-2020    12abc     gps1     (4,5)
7   14-02-2020    12abc     gps2     (7,6)
8   14-02-2020    12abc     gps2     (7,6)
9   14-02-2020    33bcd     gps2     (8,9)
10  20-02-2020    44hgf     gps2     (3,7)
11  20-02-2020    12abc     gps2     (3,1)
12  20-02-2020    33bcd     gps2     (3,6)
13  21-02-2020    12abc     gps2     (5,4)

CodePudding user response:

You could use multiple isin and chain them using & operator. Since final_gps can be either gps1 or gps2, we use | operator in brackets:

out = (df1[df1['date'].isin(df2['date']) & 
           df1['agent_id'].isin(df2['agent_id']) & 
           (df1['final_gps'].isin(df2['gps1']) | df1['final_gps'].isin(df2['gps2']))]
       .reset_index(drop=True))

Output:

         date agent_id final_gps ….
0  14-02-2020    12abc    (1, 2)  …
1  14-02-2020    12abc    (7, 6)  …
2  14-02-2020    12abc    (3, 4)  …
3  14-02-2020    33bcd    (6, 7)  …
4  14-02-2020    33bcd    (8, 9)  …
5  20-02-2020    12abc    (3, 5)  …
6  20-02-2020    12abc    (3, 1)  …
7  20-02-2020    44hgf    (1, 6)  …
8  20-02-2020    44hgf    (3, 7)  …
  • Related