In each row of df['Description'], there is a user field that has 8 digit numbers that I need to grab. But I do not want to grab the ones with <del'> in front of it. The numbers that should be retrieved are 11111113 and 11111114.
The data looks something like this (without the single quotation):
<del'>11111111 Random text here </del'><br>
<br'><del'>11111112 Random text here </del'></br'><br>
<p'>11111113 Random text here </p'><br>
<br'>11111114 Random text here </br'>
I have tried variations of this:
df['SN_Fixed_List']=[re.findall(r'\b(?!<del>)\s*[0-9]{8}\b',x) for x in df['Description']]
CodePudding user response:
You can use
df['SN_Fixed_List'] = df['Description'].str.extract(r'^(?!.*<del'>).*\b(\d{8})\b', expand=False)
See the regex demo.
Details:
^
- start of string(?!.*<del'>)
- no<del'>
allowed in the string.*
- any zero or more chars other than line break chars as many as possible\b(\d{8})\b
- eight digits as whole word (captured into Group 1 the value of which is output withSeries.str.extract
).