Home > database >  Extracting fixed length digits anywhere in given string using regex
Extracting fixed length digits anywhere in given string using regex

Time:10-31

I have pandas column with sample text given below and need to extract fixed length identifier from the text

df1=pd.DataFrame({'Incident_details':['324657_Sample text1 about the incident',
' 316678_sample text2 with details of incident',
'*DEPARTMENT LIST 316878-Sample text3 with information, ph: 01314522345',
'327787_34587621 (sample text4 with incident details)',
'Sample text5 with details',
'327997_1000587621 (sample text6 with incident info',
' 314489_incident text7 details',
'DEPARTMENT_LIST_325489_Text8 details',
'DEPARTMENT3_316489 text9 details',
'DEPARTMENT_LIST_326499',
'324512_1000257218',
'314656_text10(01345782345)',
'324757_03456789',
'DEPARTMENT_CDES_324903_35678910 (details text11)',
'326512_34500257218 - text12 details',
'Incident 325621_ 316512_ sample text 13']})
  • The identifier that I need to extract always starts with 3 and has fixed length of 6 digits.
  • It can appear at the start of string or after space (single or double or triple space) or after an underscore.
  • There can be more than one id in given string and need below output.

Output

Currently I am using

df1['Incident_id'] = df1['incident_details'].str \
   .findall(r'(?:^|\s|[^_])(\d{6})').str.join(", ")

This expression doesn't give correct output for my requirement.

CodePudding user response:

Something like this would work:

 (?:^|(?<=\D))3\d{5}(?=\D|$)
  • (?:^|(?<=\D)) - behind me is the start of the line or a non-digit char
    • variable-width lookbehinds are not supported in Python so I could not use this variant: (?<=^|\D)
  • 3\d{5} - the number 3 followed by five digits
  • (?=\D|$) - ahead of me is a non-digit char or the end of a line

https://regex101.com/r/8AoWeK/1

  • Related