Asking a follow up question to my question here: Remove substring from string if substring in list in data frame column
I have the following data frame df1
string lists
0 I HAVE A PET DOG ['fox', 'pet dog', 'cat']
1 there is a cat ['dog', 'house', 'car']
2 hello EVERYONE ['hi', 'hello', 'everyone']
3 hi my name is Joe ['name', 'was', 'is Joe']
I'm trying to return a data frame df2
that looks like this
string lists new_string
0 I HAVE A PET DOG ['fox', 'pet dog', 'cat'] I HAVE A
1 there is a cat ['dog', 'house', 'car'] there is a cat
2 hello everyone ['hi', 'hello', 'everyone']
3 hi my name is Joe ['name', 'was', 'is Joe'] hi my
The solution I was using does not work for cases where a substring is multiple words, such as pet dog
or is Joe
df['new_string'] = df['string'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in df['lists'][df['string'] == x].values[0]]))
CodePudding user response:
The question is roughly similar, but still quite different.
In this case we use re.sub
over the row axis (axis=1
):
df.apply(lambda row: re.sub("|".join(row["lists"]), "", row["string"], flags=re.I), axis=1)
string lists new_string
0 I HAVE A PET DOG [fox, pet dog, cat] I HAVE A
1 there is a cat [dog, house, car] there is a cat
2 hello EVERYONE [hi, hello, everyone]
3 hi my name is Joe [name, was, is Joe] hi my
To break it down:
df.apply
withaxis=1
applies a function to each rowre.sub
is the regex variant ofstr.replace
- We use
"|".join
to make a "|" seperated string, which acts asor
operator in regex. So it removes one of these words. flags=re.I
so it ignores case letters.
Note: since we use apply
over the row axis, this is basically a loop in the background and thus not very optimimized.