Home > Back-end >  How can I extract known number of digits from string with unknown length?
How can I extract known number of digits from string with unknown length?

Time:10-04

Let's say I have a couple of strings that look like:

data_20220110_073030.gz
ndsfhsfihso_20100330-100210.gz
l0dnd74n-19981001.180800.gz

I only want to extract information above that has 8 or 6 digits and are all numerical values from 0-9. Ideally, it would be output to a single array / list such as:

[20220110,073030]
[20100330,100210]
[19981001,180800]

I know one can use regex, but I can't seem to get it into an array.

CodePudding user response:

You may use the following pattern:

(?<!\d)\d{6}(?:\d\d)?(?!\d)

Demo.

Details:

  • (?<!\d) - Not immediately preceded by a digit.
  • \d{6} - Match exactly 6 digits.
  • (?:\d\d)? - And (optionally) two more digits.
  • (?!\d) - Not immediately followed by a digit.

Python example:

import re

regex = r"(?<!\d)\d{6}(?:\d\d)?(?!\d)"
test_str = """data_20220110_073030.gz
ndsfhsfihso_20100330-100210.gz
l0dnd74n-19981001.180800.gz"""

arr = re.findall(regex, test_str)
print(arr)

Output:

['20220110', '073030', '20100330', '100210', '19981001', '180800']

Try it online.

CodePudding user response:

You can use the python Regular Expression library to find the sequence of characters that forms the search pattern you are looking fro

Example

import re

text = 'data_20220110_073030.gz ndsfhsfihso_20100330-100210.gz l0dnd74n 19981001.180800.gz'

x = re.findall('\d\d\d\d\d\d', text) #for 6 digits sequence
y = re.findall('\d\d\d\d\d\d\d\d', text) #for 8 digits sequence

print(y)
print(x)

you can impove on that by having a function create the pattern base on the length of digits you want

import re

def digitSequence(length: int, text: str):
    pattern = ''
    for i in range(length):
        pattern  = '\d'
    
    return re.findall(pattern, text) # returns a list of match's found

print(digitSequence(8, text))
print(digitSequence(6, text))

  • Related