Home > front end >  How to search a substring from one df in another df?
How to search a substring from one df in another df?

Time:05-28

I have read this post and would like to do something similar.

I have 2 dfs:

df1:

file_num city address_line
1 Toronto 123 Fake St
2 Montreal 456 Sample Ave

df2:

DB_Num Address
AB1 Toronto 123 Fake St
AB3 789 Random Drive, Toronto

I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.

My ideal output is:

file_num city address_line DB_Num Address
1 Toronto 123 Fake St AB1 Toronto 123 Fake St

Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.

df1['search_field'] = "(?=.*"   df1['city']   ")(?=.*"   df1['address_line']   ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("("   pat   ')', expand=False))

Since my address in df2 is entered manually, it is sometimes out of order.

Because it is out of order, I am using the look ahead method of regex.

The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.

My main problem is I have no way to join back to df1 to get the file_num.

I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?

Thanks!

CodePudding user response:

Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:

r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df

#output:

0    123 Fake St
1            NaN
Name: Address, dtype: object

Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:

df1.merge(df2, left_on='address_line', right_on=merge_df)
index file_num City address_line DB_num Address
0 1.0 Toronto 123 Fake St AB1 Toronto 123 Fake St
  • Related