I have a dataset where I try to select only the rows, which exactly match de defined strings in the list.
list = ['P:34.', 'R:ES.']
df = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Code':['P:34. R:ES.', 'R:ESB.', 'K2P:P:341.', 'R:ESZ', 'P:34.', 'R.ES7.', 'P 34 5', 'P:32. R:ES.'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})
I used the function str.contains
to select the rows accordingly but with that, I get rows with do not match the strings exactly.
sample = df[df.Code.str.contains('|'.join(list),na=False)]
I try to get only the rows that contain exactly the strings (also considering the point at the end of the strings) in the list which would be something like that:
df_exact_match = pd.DataFrame({
'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
'Code':['P:34. R:ES.', 'P:34.', 'P:32. R:ES.'],
'Ratings':[9.0, 2, 5]})
Thank you very much for your advice :)
CodePudding user response:
You can adjust a bit your code. I would first split the 'Code' column and then use isin
in conjunction with any(axis=1)
, which would allow for any values of that list to be included in your 'Code' split column, which is broken into parts:
l = ['P:34.', 'R:ES.']
df.loc[df['Code'].str.split(expand=True).isin(l).any(1)]
prints:
Date Code Ratings
0 2021-01-01 P:34. R:ES. 9.0
4 2021-01-02 P:34. 2.0
7 2021-01-03 P:32. R:ES. 5.0
Also not a good practise to name your custom list
, list. Better to use a different name. I would also advise not to use str.contains
as that would return partial matches, as the name suggests, and not exact matcthes.
CodePudding user response:
I get rows with do not match the strings exactly.
That happens because by default Series.str.contains
assumes that the first argument is a regex pattern, and in regex the dot .
matches any single character. To match the literal .
you have to escape it (i.e. \.
). There is no need to specify na=False
btw.
>>> l = ['P:34\.', 'R:ES\.']
>>> df[df.Code.str.contains('|'.join(l))]
Date Code Ratings
0 2021-01-01 P:34. R:ES. 9.0
4 2021-01-02 P:34. 2.0
7 2021-01-03 P:32. R:ES. 5.0