dataframe select a word on the text-CodePudding

i have a datafame in input and i want to extract in column "localisation" the word in this list ["SECTION 11","CÔNE","BELLY"] and i have to create new column "word" in the dataframe. If the word of the list exist in the the column "localisation" i fill the word in the column created "word".otherwise i put the full text in the column "word" This my dataframe

I create new column "word" I selected the line containing the words from the list I fill in the "word" column with the keywords found from the list

["SECTION 11","CÔNE","BELLY"]
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")
df["temp"]=df["localisation"].str.extract("Localisation[\s]*:.*\n([^_\n]{3,})\n[^\n]*\n")
df["word"]=df["temp"].str.extract("(SECTION 11|CÔNE|BELLY)")

my problem I can not put the full text if the word of the list is not found in the column "localization". I have null values in the lines or I have to put the full text

CodePudding user response：

You need to use .fillna with the df["localisation"] as argument:

df["word"]=df["localisation"].str.extract(r"\b(SECTION 11|CÔNE|BELLY)\b", expand=False).fillna(df["localisation"])

Note also that I suggested r"\b(SECTION 11|CÔNE|BELLY)\b", a regex with word boundaries to only match your alternatives as whole words. Note that word boundaries \b are Unicode-aware in Python re that is used behind the scenes in Pandas.

If you do not need the whole word search, you may keep using r"SECTION 11|CÔNE|BELLY".