Home > Net >  remove or replace only the word after a specific word in column pandas using regex
remove or replace only the word after a specific word in column pandas using regex

Time:11-07

I'm trying to use regex to remove or replace only the word after specific word(s) in a column of strings in a dataframe. This means I don't want the spaces to be replace. Just the word the proceeds the specific word(s)

import pandas as pd

df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT     VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
 
df.STRING.str.replace(r'/(?<=NO|WITHOUT)(\s )\w','', regex=True)  #this doesn't work

here's my output:

                                              String  \
0        THERE IS NO REASON WHY THIS SHOULDN'T WORK!   
1           I AM WITHOUT DOUBT     VERY BAD AT REGEX   
2        I CAN'T SOLVE A PROBLEM THAT HAS NO INT...   

                                      desired_output  
0              THERE IS NO  WHY THIS SHOULDN'T WORK!  
1                I AM WITHOUT      VERY BAD AT REGEX  
2         I CAN'T SOLVE A PROBLEM THAT HAS NO  VALUE  

Again, i don't want the spaces between them to be removed. I only want the one word after NO or WITHOUT to be removed/replaced.

CodePudding user response:

Note that your regex, /(?<=NO|WITHOUT)(\s )\w, contains several issues:

  • / - is a typo, it was probably a regex delimiter that got into the pattern
  • (?<=NO|WITHOUT) - is a lookbehind pattern where alternatives match strings of different length and Python lookbehinds patterns must be fixed-width
  • \w - matches a single word char, not one or more. There must be some quantifier after \w, * (zero or more times) or (one or more occurrences).

You can use

import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT     VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
pattern = r'\b((?:NO|WITHOUT)\s )\w '
df['STRING'] = df['STRING'].str.replace(pattern, r'\1', regex=True)

Output:

>>> print(df.to_string())
                                      STRING
0      THERE IS NO  WHY THIS SHOULDN'T WORK!
1        I AM WITHOUT      VERY BAD AT REGEX
2  I CAN'T SOLVE A PROBLEM HAT HAS NO  VALUE 

See the regex demo. Details:

  • \b - a word boundary
  • ((?:NO|WITHOUT)\s ) - Group 1 (\1 refers to this group value from the replacement pattern): NO or WITHOUT and then one or more whitespaces
  • \w - one or more word chars (replace with \S if you plan to remove one or more non-whitespace chars, or even \S \b to cut off trailing punctutation).
  • Related