I have a DataFrame like this:
df = pd.DataFrame({
'col_1':['filmeinlage federspeicher anlegen',
'filmeinlage lm a-kreis',
'weco-pvb-primerspray ral 3012',
'tragrolle unten (metall) talent,t3',
'metallschutzschlauch, spr-va 36',
'gummi pflege liqui moly 500ml',
'gummikugel für 5-stellungskippschalter',
'megaphone er-520 6/10w abs',
'weco primerspray -lar- 3012'],
'col_2':['lm',
'lm',
'pvb',
'metall',
'metall',
'gummi',
'gummi',
'abs',
'lar']
})
I would like to check if the string in Col_2 is present in Col_1, but only if it is on its own or is surrounded by the special characters, and if this is the case I would like to return True in the new column and False if otherwise, like shown in the example.
For an instance if Col_2 has a string 'lm' and Col_1 has 'filmeinlage' it should return False, but if Col_1 has 'filmeinlage lm a-kreis' it should return True
Col_1 | Col_2 | Desired_Column |
---|---|---|
filmeinlage federspeicher anlegen | lm | False |
filmeinlage lm a-kreis | lm | True |
weco-pvb-primerspray ral 3012 | pvb | True |
tragrolle unten (metall) talent,t3 | metall | True |
metallschutzschlauch, spr-va 36 | metall | False |
gummi pflege liqui moly 500ml | gummi | True |
gummikugel für 5-stellungskippschalter | gummi | False |
megaphone er-520 6/10w abs | abs | True |
weco primerspray -lar- 3012 | lar | True |
CodePudding user response:
You're looking for "word boundaries", i.e. "\b" in regexes:
df["new"] = [re.search(fr"\b{re.escape(c2)}\b", c1) is not None
for c1, c2 in zip(df["col_1"], df["col_2"])]
- zip the 2 columns col_1, col_2
- for each pairs, look for a \b-surrounded c2 value in c1
- c1 and c2 take row values in respective columns in each turn
- \b means there's not a word character there; it might be empty (beginning or end) or parantheses, or space etc.
- if the search does not return None, it means it matches; otherwise, no match
re.escape
is there to prevent possible special characters within col_2 values- e.g., if a value has "(" in it, it's special to regexes, so it's replaced with "\(" to literally mean parantheses.
to get
>>> df
col_1 col_2 new
0 filmeinlage federspeicher anlegen lm False
1 filmeinlage lm a-kreis lm True
2 weco-pvb-primerspray ral 3012 pvb True
3 tragrolle unten (metall) talent,t3 metall True
4 metallschutzschlauch, spr-va 36 metall False
5 gummi pflege liqui moly 500ml gummi True
6 gummikugel für 5-stellungskippschalter gummi False
7 megaphone er-520 6/10w abs abs True
8 weco primerspray -lar- 3012 lar True