I have a dataframe such as:
import pandas as pd
import re
df = pd.DataFrame({"Name": ["D1", "D2", "D3", "D4", "M1", "M2", "M3"],
"Requirements": ["3 meters|2/3 meters|3.5 meters",
"3 meters",
"3/5 meters|3 meters",
"2/3 meters",
"steel|g1_steel",
"steel",
"g1_steel"]})
dataframe df
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
3 D4 2/3 meters
4 M1 steel|g1_steel
5 M2 steel
6 M3 g1_steel
I have a list of words req_list = ['3 meters', 'steel']
and I am trying to extract rows from df
where the strings in column Requirements
contain standalone words that are from req_list
. This is what I have done:
This one prints just D2 and M2
df[df.Requirements.apply(lambda x: any(len(x.replace(y, '')) == 0 for y in req_list))]
This one prints all rows
df[df['Requirements'].str.contains(fr"\b(?:{'|'.join(req_list)})\b")]
My desired result is as follows:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel
In this desired output, D4 and M3 are eliminated because they do not have words from req_list
as standalone strings. Is there any way to achieve this preferably in an one-liner without using custom functions?
EDIT
The strings in the column Requirements
can come in any pattern such as:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
3 D4 2/3 meters
4 D5 3::3 meters # New pattern which needs to be eliminated
5 D6 3.3 meters # New pattern which needs to be eliminated
6 D7 3?3 meters # New pattern which needs to be eliminated
7 M1 steel|g1_steel
8 M2 steel
9 M3 g1_steel
CodePudding user response:
Since you want to make sure you do not match 3 meters
that is preceded with a digit /
, you may add a (?<!\d/)
negative lookbehind after the intial word boundary:
df[df['Requirements'].str.contains(fr"\b(?<!\d/)(?:{'|'.join(req_list)})\b")]
Output:
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel
See the regex demo.
Notes
- Since
req_list
contains phrases (mutiword strings) you might have to sort the items by length in the descending order before joining with the|
OR operator, so you'd better usefr"\b(?<!\d/)(?:{'|'.join(sorted(req_list, key=len, reverse=True))})\b"
as regex - If the
req_list
ever contains items with special chars you should also use adaptive dynamic word boundaries, i.e.fr"(?!\B\w)(?<!\d/)(?:{'|'.join(sorted(map(re.escape, req_list), key=len, reverse=True))})(?<!\w\B)"
.
CodePudding user response:
Here is one more way to do it
def chk(row):
for r in row:
if r.strip() in req_list:
return True
return False
df[df.assign(lst = df['Requirements'].str.split('|'))['lst'].apply(chk) == True]
Name Requirements
0 D1 3 meters|2/3 meters|3.5 meters
1 D2 3 meters
2 D3 3/5 meters|3 meters
4 M1 steel|g1_steel
5 M2 steel