How can I extract known number of digits from string with unknown length?-CodePudding

Let's say I have a couple of strings that look like:

data_20220110_073030.gz
ndsfhsfihso_20100330-100210.gz
l0dnd74n-19981001.180800.gz

I only want to extract information above that has 8 or 6 digits and are all numerical values from 0-9. Ideally, it would be output to a single array / list such as:

[20220110,073030]
[20100330,100210]
[19981001,180800]

I know one can use regex, but I can't seem to get it into an array.

CodePudding user response：

You may use the following pattern:

(?<!\d)\d{6}(?:\d\d)?(?!\d)

Demo.

Details:

(?<!\d) - Not immediately preceded by a digit.
\d{6} - Match exactly 6 digits.
(?:\d\d)? - And (optionally) two more digits.
(?!\d) - Not immediately followed by a digit.

Python example:

import re

regex = r"(?<!\d)\d{6}(?:\d\d)?(?!\d)"
test_str = """data_20220110_073030.gz
ndsfhsfihso_20100330-100210.gz
l0dnd74n-19981001.180800.gz"""

arr = re.findall(regex, test_str)
print(arr)

Output:

['20220110', '073030', '20100330', '100210', '19981001', '180800']

Try it online.

CodePudding user response：

You can use the python Regular Expression library to find the sequence of characters that forms the search pattern you are looking fro

Example

import re

text = 'data_20220110_073030.gz ndsfhsfihso_20100330-100210.gz l0dnd74n 19981001.180800.gz'

x = re.findall('\d\d\d\d\d\d', text) #for 6 digits sequence
y = re.findall('\d\d\d\d\d\d\d\d', text) #for 8 digits sequence

print(y)
print(x)

you can impove on that by having a function create the pattern base on the length of digits you want

import re

def digitSequence(length: int, text: str):
    pattern = ''
    for i in range(length):
        pattern  = '\d'
    
    return re.findall(pattern, text) # returns a list of match's found

print(digitSequence(8, text))
print(digitSequence(6, text))