Regex - Match substring with Strings-CodePudding

I have 2 Dataframes, What I am trying to achieve to to create a list of the IDs that match. With this information I can then perform a merge later on (The merge isn't part of my question, I just want to find out how to match the rows correctly).

DF1

ID
123
123B
234
234B
456
456B
asfdh5
45gh

DF2

ID
123456
123B456
BBBER4
234567YT
B9234BAA
456XXA32
999GHF
pp9985
88AVKVBD

Desired Result - I want to get a list of string that match

	Matches
0	123
1	123B
2	NaN
3	234
4	234B
5	456
6	NaN
7	NaN
8	NaN

The current code I am using is the following

df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)

The issue is that this code doesn't match the IDs in DF1 exactly, if you see for index 2 in Desired Results above the result is 123B, However I'm only getting 123.

Below are the results I'm getting with my current code.

	Matches
0	123
1	123
2	NaN
3	234
4	234
5	456
6	NaN
7	NaN
8	NaN

Any help would be much appreciated. Thanking you, Cian

CodePudding user response：

Your problem is that it checks from the top of DF1 when it looks for a match and returns the first match. If you sort the dataframe by the length of the strings in the column it returns the longest match instead.

This should fix your problem:

# Sort by length of string
new_index = df1.ID.str.len().sort_values().index
df1 = df1.reindex(index = new_index[::-1]).reset_index(drop=True)

# Match strings
df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)

CodePudding user response：

This behaviour is expected because "For each subject string in the Series, extract groups from the first match of regular expression pat." (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html). So you're getting '123' instead of '123B' because '123' is the first value in the series in df1, and hence the first match.

CodePudding user response：

All you need to do is to sort your df1['ID'] values by length in descending order using sorted(df1['ID'].values, key=len, reverse=True):

>>> df2['ID'].str.extract(fr"({'|'.join(sorted(df1['ID'].values, key=len, reverse=True))})", expand=False)
0     123
1    123B
2     NaN
3     234
4    234B
5     456
6     NaN
7     NaN
8     NaN
Name: ID, dtype: object

You need this because longer alternative in NFA regex engines (Python's re is) always "wins" and all the other alternatives in the same group are not even tried by the regex engine. See Remember That The Regex Engine Is Eager. E,g. you hd 123|123B in your regex, so 123 matched 123 in 123B456 and returned that match, 123B alternative was not tried, and so on.