I have 2 Dataframes, What I am trying to achieve to to create a list of the IDs that match. With this information I can then perform a merge later on (The merge isn't part of my question, I just want to find out how to match the rows correctly).
DF1
ID |
---|
123 |
123B |
234 |
234B |
456 |
456B |
asfdh5 |
45gh |
DF2
ID |
---|
123456 |
123B456 |
BBBER4 |
234567YT |
B9234BAA |
456XXA32 |
999GHF |
pp9985 |
88AVKVBD |
Desired Result - I want to get a list of string that match
Matches | |
---|---|
0 | 123 |
1 | 123B |
2 | NaN |
3 | 234 |
4 | 234B |
5 | 456 |
6 | NaN |
7 | NaN |
8 | NaN |
The current code I am using is the following
df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
The issue is that this code doesn't match the IDs in DF1 exactly, if you see for index 2 in Desired Results above the result is 123B, However I'm only getting 123.
Below are the results I'm getting with my current code.
Matches | |
---|---|
0 | 123 |
1 | 123 |
2 | NaN |
3 | 234 |
4 | 234 |
5 | 456 |
6 | NaN |
7 | NaN |
8 | NaN |
Any help would be much appreciated. Thanking you, Cian
CodePudding user response:
Your problem is that it checks from the top of DF1 when it looks for a match and returns the first match. If you sort the dataframe by the length of the strings in the column it returns the longest match instead.
This should fix your problem:
# Sort by length of string
new_index = df1.ID.str.len().sort_values().index
df1 = df1.reindex(index = new_index[::-1]).reset_index(drop=True)
# Match strings
df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
CodePudding user response:
This behaviour is expected because "For each subject string in the Series, extract groups from the first match of regular expression pat." (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html). So you're getting '123' instead of '123B' because '123' is the first value in the series in df1
, and hence the first match.
CodePudding user response:
All you need to do is to sort your df1['ID']
values by length in descending order using sorted(df1['ID'].values, key=len, reverse=True)
:
>>> df2['ID'].str.extract(fr"({'|'.join(sorted(df1['ID'].values, key=len, reverse=True))})", expand=False)
0 123
1 123B
2 NaN
3 234
4 234B
5 456
6 NaN
7 NaN
8 NaN
Name: ID, dtype: object
You need this because longer alternative in NFA regex engines (Python's re
is) always "wins" and all the other alternatives in the same group are not even tried by the regex engine. See Remember That The Regex Engine Is Eager. E,g. you hd 123|123B
in your regex, so 123
matched 123
in 123B456
and returned that match, 123B
alternative was not tried, and so on.