Home > database >  Remove combination of string in dataset in Python
Remove combination of string in dataset in Python

Time:01-31

I have a dataset in Python where I want to remove certain combinations of words of colomnX in a new columnY.

Example of 2 rows of columnX:

what is good: the weather what needs improvwement: the house
what is good: everything what needs improvement: nothing

I want tot delete the following combination of words: "what is good" & "what needs improvement".

In the end the following text should remain in the columnY:

the weather the house everything nothing

I have the following script:

stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))

But it doesn't work. What am I doing wrong here?

CodePudding user response:

I think you need to check whether item is not in the stoplist. You are checking if item is in the stoplist. So you need to just change in with not in:

stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ' '.join([item for item in x.split() if item not in stoplist]))

CodePudding user response:

Maybe you can operate on the columns itself.

df["Y"] = df["X"]

df.Y = df.Y.str.replace("what is good", "")

So you would have to do this for every item in your stop list. But I am not sure how many items you have.

So for example

replacement_map = {"what needds improvement": "", "what is good": ""}

for old, new in replacement_map.items():
    df.Y = df.Y.str.replace(old, new)
  • Related