Home > other >  Remove star and empty line using regex
Remove star and empty line using regex

Time:11-02

I have dataframe i want remove star and all the empty line in localisation. I have to create two columns "temp" and "word". "temp" contains all the lines after the first line break and the column "word" represents all the words of this list [SECTION 11, CONE, BELLY, FIXED PLAN] found in "temp". My input :

enter image description here

Ouput expected: I have to replace star with empty in the column word:

enter image description here

I try this but it doesn't work:

enter image description here

CodePudding user response:

You can use

import re
df['temp'] = df['localisation'].str.replace(r'^.*\n', '', regex=True)
words = ['SECTION 11', 'CONE', 'BELLY', 'FIXED PLAN']
df['word'] = df['temp'].str.findall(fr'(?<!\w)(?:{"|".join([re.escape(w) for w in words])})(?!\w)').str.join(', ')

Details:

  • .str.replace(r'^.*\n', '', regex=True) removes the first line with the line break
  • .str.findall(fr'(?<!\w)(?:{"|".join([re.escape(w) for w in words])})(?!\w)') extracts all occurrences of the words as whole words (due to (?<!\w) and (?!\w) unambiguous word boundaries) in the temp column while escaping all non-word chars in the words.
  • .str.join(', ') at the end of the last code line is used to join the found list of matches with a comma space. You may further adjust what to join the matches with if you edit the argument in the .str.join() part.
  • Related