Home > other >  Pandas filter by substrings from two columns
Pandas filter by substrings from two columns

Time:10-27

I have two dataframes that are as follows:

Dataframe df1:

    Rep
0   ec21b_AI_154OH
1   m2010_AI_066UW
2   20wh1_DS_416FC

Dataframe df2:

    Address     FirstPart   SecondPart
0   address13   m2010       066UW
1   address22   2020e       999GV
2   address26   2020c       513DT
3   address35   evd18       874GO
4   address36   ep21b       986CG
5   address493  20wh1       416FC
6   address628  ec21b       154OH

I want to add an Address column in dataframe df1 using pandas, so that it looks as follows:

    Rep             Address
0   ec21b_AI_154OH  address628
1   m2010_AI_066UW  address13
2   20wh1_DS_416FC  address493

For every row in df2, I can search for matches in df1, and put the address. However, is there a better way to do it?

My minimum working example is as follows:

for First, Second, Add in zip(list(df2['FirstPart']),list(df2['SecondPart']),list(df2['Address'])):
    condition = df1['Rep'].str.contains(First) & df1['Rep'].str.contains(Second)
    df1.loc[condition,"Address"] = Add

Note: It is not necessary to find _ as a delimiter, and there can be other delimiters as well.

CodePudding user response:

IIUC, you can extract the two part to use for a merge (here with named capturing groups, although optional if you use the left_on/right_on parameters or merge):

out = df1.join(
 df1['Rep']
 .str.extract(r'^(?P<FirstPart>[^_] ).*?(?P<SecondPart>[^_] )$')
 .merge(df2, how='left')['Address']
)

output:

              Rep     Address
0  ec21b_AI_154OH  address628
1  m2010_AI_066UW   address13
2  20wh1_DS_416FC  address493

CodePudding user response:

you can just loop trought the both columns:

Those are you dfs, on the first df just add the new column Address:

 import pandas as pd
df_dict_1 = {
    'Rep':['ec21b_AI_154OH', 'm2010_AI_066UW', '20wh1_DS_416FC'],
    
    'Address': [None, None, None]
        
   
        }

df_dict_2 = {
    'Address':['address13', 'address22', 'address26', 'address35', 'address36',\
                            'address493', 'address628' ],
    
    'FirstPart':['m2010', '2020e', '2020c', 'evd18', 'ep21b',\
                            '20wh1', 'ec21b' ],
    
    'SecondPart':['066UW', '999GV', '513DT', '874GO', '986CG',\
                            '416FC', '154OH' ],
   
        }




df1 = pd.DataFrame(df_dict_1)
df2 = pd.DataFrame(df_dict_2)

And loop trought the second one:

rep_list = list(df1.Rep)


for x in range(df2.shape[0]):
  
    first_part = df2.FirstPart.iloc[x]
    second_part = df2.SecondPart.iloc[x] 

    rep = ''.join([x for x in rep_list if first_part in x and  second_part in x])
    adress = df2.Address.iloc[x]

    for x in range(len(df1.Rep)):
        if rep ==df1.Rep[x]:
            df1.Address.iloc[x] = adress
            

Output:

    Rep             Address
0   ec21b_AI_154OH  address628
1   m2010_AI_066UW  address13
2   20wh1_DS_416FC  address493
    
  • Related