How to match a string that doesn't start with <del> but ends with ######## with regex-CodePudding

In each row of df['Description'], there is a user field that has 8 digit numbers that I need to grab. But I do not want to grab the ones with <del'> in front of it. The numbers that should be retrieved are 11111113 and 11111114.
The data looks something like this (without the single quotation):

<del'>11111111 Random text here </del'><br>
<br'><del'>11111112 Random text here </del'></br'><br>
<p'>11111113 Random text here </p'><br>
<br'>11111114 Random text here </br'>

I have tried variations of this:

df['SN_Fixed_List']=[re.findall(r'\b(?!<del>)\s*[0-9]{8}\b',x) for x in df['Description']]

CodePudding user response：

You can use

df['SN_Fixed_List'] = df['Description'].str.extract(r'^(?!.*<del'>).*\b(\d{8})\b', expand=False)

See the regex demo.

Details:

^ - start of string
(?!.*<del'>) - no <del'> allowed in the string
.* - any zero or more chars other than line break chars as many as possible
\b(\d{8})\b - eight digits as whole word (captured into Group 1 the value of which is output with Series.str.extract).