Home > Back-end >  Check if a string surrounded by the special characters is present in another string
Check if a string surrounded by the special characters is present in another string

Time:01-12

I have a DataFrame like this:

  df = pd.DataFrame({
        'col_1':['filmeinlage federspeicher anlegen',
                 'filmeinlage lm a-kreis', 
                 'weco-pvb-primerspray ral 3012', 
                 'tragrolle unten (metall) talent,t3', 
                 'metallschutzschlauch, spr-va 36', 
                 'gummi pflege liqui moly 500ml', 
                 'gummikugel für 5-stellungskippschalter', 
                 'megaphone er-520 6/10w abs', 
                 'weco primerspray -lar- 3012'],
        'col_2':['lm',
                 'lm', 
                 'pvb', 
                 'metall', 
                 'metall', 
                 'gummi', 
                 'gummi', 
                 'abs', 
                 'lar']
    })

I would like to check if the string in Col_2 is present in Col_1, but only if it is on its own or is surrounded by the special characters, and if this is the case I would like to return True in the new column and False if otherwise, like shown in the example.
For an instance if Col_2 has a string 'lm' and Col_1 has 'filmeinlage' it should return False, but if Col_1 has 'filmeinlage lm a-kreis' it should return True

Col_1 Col_2 Desired_Column
filmeinlage federspeicher anlegen lm False
filmeinlage lm a-kreis lm True
weco-pvb-primerspray ral 3012 pvb True
tragrolle unten (metall) talent,t3 metall True
metallschutzschlauch, spr-va 36 metall False
gummi pflege liqui moly 500ml gummi True
gummikugel für 5-stellungskippschalter gummi False
megaphone er-520 6/10w abs abs True
weco primerspray -lar- 3012 lar True

CodePudding user response:

You're looking for "word boundaries", i.e. "\b" in regexes:

df["new"] = [re.search(fr"\b{re.escape(c2)}\b", c1) is not None
             for c1, c2 in zip(df["col_1"], df["col_2"])]
  • zip the 2 columns col_1, col_2
  • for each pairs, look for a \b-surrounded c2 value in c1
    • c1 and c2 take row values in respective columns in each turn
    • \b means there's not a word character there; it might be empty (beginning or end) or parantheses, or space etc.
  • if the search does not return None, it means it matches; otherwise, no match
  • re.escape is there to prevent possible special characters within col_2 values
    • e.g., if a value has "(" in it, it's special to regexes, so it's replaced with "\(" to literally mean parantheses.

to get

>>> df

                                    col_1   col_2    new
0       filmeinlage federspeicher anlegen      lm  False
1                  filmeinlage lm a-kreis      lm   True
2           weco-pvb-primerspray ral 3012     pvb   True
3      tragrolle unten (metall) talent,t3  metall   True
4         metallschutzschlauch, spr-va 36  metall  False
5           gummi pflege liqui moly 500ml   gummi   True
6  gummikugel für 5-stellungskippschalter   gummi  False
7              megaphone er-520 6/10w abs     abs   True
8             weco primerspray -lar- 3012     lar   True
  • Related