Home > Mobile >  Remove multi-word substring from string if substring in list in data frame column
Remove multi-word substring from string if substring in list in data frame column

Time:09-20

Asking a follow up question to my question here: Remove substring from string if substring in list in data frame column

I have the following data frame df1

       string             lists
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']
1      there is a cat     ['dog', 'house', 'car']
2      hello EVERYONE     ['hi', 'hello', 'everyone']
3      hi my name is Joe  ['name', 'was', 'is Joe']

I'm trying to return a data frame df2 that looks like this

       string             lists                         new_string
0      I HAVE A PET DOG   ['fox', 'pet dog', 'cat']     I HAVE A
1      there is a cat     ['dog', 'house', 'car']       there is a cat
2      hello everyone     ['hi', 'hello', 'everyone']   
3      hi my name is Joe  ['name', 'was', 'is Joe']     hi my

The solution I was using does not work for cases where a substring is multiple words, such as pet dog or is Joe

df['new_string'] = df['string'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in df['lists'][df['string'] == x].values[0]]))

CodePudding user response:

The question is roughly similar, but still quite different.

In this case we use re.sub over the row axis (axis=1):

df.apply(lambda row: re.sub("|".join(row["lists"]), "", row["string"], flags=re.I), axis=1)
              string                  lists      new_string
0   I HAVE A PET DOG    [fox, pet dog, cat]       I HAVE A 
1     there is a cat      [dog, house, car]  there is a cat
2     hello EVERYONE  [hi, hello, everyone]                
3  hi my name is Joe    [name, was, is Joe]         hi my 

To break it down:

  1. df.apply with axis=1 applies a function to each row
  2. re.sub is the regex variant of str.replace
  3. We use "|".join to make a "|" seperated string, which acts as or operator in regex. So it removes one of these words.
  4. flags=re.I so it ignores case letters.

Note: since we use apply over the row axis, this is basically a loop in the background and thus not very optimimized.

  • Related