I'm trying to use regex to remove or replace only the word after specific word(s) in a column of strings in a dataframe. This means I don't want the spaces to be replace. Just the word the proceeds the specific word(s)
import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
df.STRING.str.replace(r'/(?<=NO|WITHOUT)(\s )\w','', regex=True) #this doesn't work
here's my output:
String \
0 THERE IS NO REASON WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT DOUBT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM THAT HAS NO INT...
desired_output
0 THERE IS NO WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM THAT HAS NO VALUE
Again, i don't want the spaces between them to be removed. I only want the one word after NO or WITHOUT to be removed/replaced.
CodePudding user response:
Note that your regex, /(?<=NO|WITHOUT)(\s )\w
, contains several issues:
/
- is a typo, it was probably a regex delimiter that got into the pattern(?<=NO|WITHOUT)
- is a lookbehind pattern where alternatives match strings of different length and Python lookbehinds patterns must be fixed-width\w
- matches a single word char, not one or more. There must be some quantifier after\w
,*
(zero or more times) or
You can use
import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
pattern = r'\b((?:NO|WITHOUT)\s )\w '
df['STRING'] = df['STRING'].str.replace(pattern, r'\1', regex=True)
Output:
>>> print(df.to_string())
STRING
0 THERE IS NO WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM HAT HAS NO VALUE
See the regex demo. Details:
\b
- a word boundary((?:NO|WITHOUT)\s )
- Group 1 (\1
refers to this group value from the replacement pattern):NO
orWITHOUT
and then one or more whitespaces\w
- one or more word chars (replace with\S
if you plan to remove one or more non-whitespace chars, or even\S \b
to cut off trailing punctutation).