I have a sample series:
s = pd.Series(['Complexity Level 1', 'RandomName', 'I-Invoice Submission test', 'I-test2', 'I-string with multiple words'])
I'm trying to capture only strings that begin with "I-". Using extract
.
extract1 = s.str.extract(r'I-(\w )')
Current Output:
0
0 NaN
1 NaN
2 Invoice
3 test2
4 string
It's currently only extracting the first word. But I want all words and white space after the identifier. This could be up to 5 words
Is this a regex adjustment or is there a better method?
What I want is:
0
0 NaN
1 NaN
2 Invoice Submission test
3 test2
4 string with multiple words
CodePudding user response:
The regex that will do the job is r'I-(.*)'
?. Meaning: capture any character (until a newline) after "|-"
.
EDIT (From comments):
To capture any character up until a comma use I-([^,]*)
. Meaning: capture any character that is not a comma (,
) after "|-"
.