I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num | city | address_line |
---|---|---|
1 | Toronto | 123 Fake St |
2 | Montreal | 456 Sample Ave |
df2:
DB_Num | Address |
---|---|
AB1 | Toronto 123 Fake St |
AB3 | 789 Random Drive, Toronto |
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num | city | address_line | DB_Num | Address |
---|---|---|---|---|
1 | Toronto | 123 Fake St | AB1 | Toronto 123 Fake St |
Based on the above linked post, I have made a look ahead regex, and am searching using the insert
and str.extract
method.
df1['search_field'] = "(?=.*" df1['city'] ")(?=.*" df1['address_line'] ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" pat ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract
to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!
CodePudding user response:
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index | file_num | City | address_line | DB_num | Address |
---|---|---|---|---|---|
0 | 1.0 | Toronto | 123 Fake St | AB1 | Toronto 123 Fake St |