Select rows that match exactly with string contain-CodePudding

I have a dataset where I try to select only the rows, which exactly match de defined strings in the list.

list  = ['P:34.', 'R:ES.'] 

df = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
    'Code':['P:34. R:ES.', 'R:ESB.', 'K2P:P:341.', 'R:ESZ', 'P:34.', 'R.ES7.', 'P 34 5', 'P:32. R:ES.'], 
    'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})

I used the function str.contains to select the rows accordingly but with that, I get rows with do not match the strings exactly.

sample = df[df.Code.str.contains('|'.join(list),na=False)]

I try to get only the rows that contain exactly the strings (also considering the point at the end of the strings) in the list which would be something like that:

df_exact_match = pd.DataFrame({
    'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
    'Code':['P:34. R:ES.', 'P:34.', 'P:32. R:ES.'], 
    'Ratings':[9.0, 2, 5]})

Thank you very much for your advice :)

CodePudding user response：

You can adjust a bit your code. I would first split the 'Code' column and then use isin in conjunction with any(axis=1), which would allow for any values of that list to be included in your 'Code' split column, which is broken into parts:

l  = ['P:34.', 'R:ES.'] 
df.loc[df['Code'].str.split(expand=True).isin(l).any(1)]

prints:

         Date         Code  Ratings
0  2021-01-01  P:34. R:ES.      9.0
4  2021-01-02        P:34.      2.0
7  2021-01-03  P:32. R:ES.      5.0

Also not a good practise to name your custom list, list. Better to use a different name. I would also advise not to use str.contains as that would return partial matches, as the name suggests, and not exact matcthes.

CodePudding user response：

I get rows with do not match the strings exactly.

That happens because by default Series.str.contains assumes that the first argument is a regex pattern, and in regex the dot . matches any single character. To match the literal . you have to escape it (i.e. \.). There is no need to specify na=False btw.

>>> l  = ['P:34\.', 'R:ES\.'] 
>>> df[df.Code.str.contains('|'.join(l))]

         Date         Code  Ratings
0  2021-01-01  P:34. R:ES.      9.0
4  2021-01-02        P:34.      2.0
7  2021-01-03  P:32. R:ES.      5.0