pandas python regex find all words that begin, end or contain '-CodePudding

I would like to find out all words, numbers that start or end or contain '.

I tried by writing 2 regex as below. In case of the second one I added ?: to say that text at the end of the word or at the beginning of the word is optional. But not getting required results. What did you I do wrong? I would like to find I've, 'had, not', you're, 123'45 - basically everything that has '

import re
xyz="I've never 'had somebody [redacted-number] [redacted-number] [redacted-number] not. not' you're  123'45"


print (re.findall("\w \'\w ", xyz))
print (re.findall("(?:\w )\'(?:\w )", xyz))

["I've", "you're", "123'45"]
["I've", "you're", "123'45"]

CodePudding user response：

You're almost there. Try this:

(?:\w )?'(?:\w )?

(?:\w ) => ?: ensures Non capturing group, \w matches word character between 1 and unlimited times. ? ensures to match the previous token between 0 and 1 time.

https://regex101.com/r/N8Y9cQ/1

CodePudding user response：

You want to capture all words that contain a ' anywhere within them, no? Try this:

re.findall("\w*'\w*", xyz)

CodePudding user response：

You can use

\w*(?!\B'\B)'\w*
\w '\w*|'\w

See the regex demo #1 / regex demo #2.

Details

\w*(?!\B'\B)'\w* - zero or more word chars, a ' char (that is not preceded and followed with non-word chars or start/end of string), zero or more word chars
\w '\w*|'\w - one or more word chars, ', zero or more word chars, OR a ' char and then one or more word chars.

See the Python demo:

import re
xyz="I've never 'had somebody [redacted-number] [redacted-number] [redacted-number] not. not' you're  123'45"
print (re.findall(r"\w*(?!\B'\B)'\w*", xyz))
# => ["I've", "'had", "not'", "you're", "123'45"]

In Pandas, you can use Series.str.findall:

df['result'] = df['source'].str.findall(r"\w*(?!\B'\B)'\w*")