Home > Blockchain >  split string until a 5-7 digit number is found in python
split string until a 5-7 digit number is found in python

Time:04-30

I have strings like the following:

1338516 -...pair - 5pk 1409093 -...re Wax 3Pk
1409085 -...dtnr - 5pk 1415090 -...accessories
490663 - 3 pack 1490739 -...2 - 3 pack

What I'm trying to do is, split these strings so that the first string is 1338516 -...pair - 5pk and the second one is 1409093 -...re Wax 3Pk.

Currently, I'm able to extract the numbers using the following code:

list(filter(lambda k: '...' in k, reqText))
lst1 = ''.join(lst)
numbers = re.findall(r'\d ', lst1)
numbers1 = [x for x in numbers if len(x) > 3]

Any suggestions?

CodePudding user response:

You could use split with a pattern:

[^\S\n] (?=\d{5,7}\b)

Explanation

  • [^\S\n] Match 1 or more spaces without a newline
  • (?=\d{5,7}\b) Positive lookahead, assert 5-7 digits to the right followed by a word boundary

Regex demo

import re

pattern = r"[^\S\n] (?=\d{5,7}\b)"

lst = [
    "1338516 -...pair - 5pk 1409093 -...re Wax 3Pk",
    "1409085 -...dtnr - 5pk 1415090 -...accessories",
    "490663 - 3 pack 1490739 -...2 - 3 pack"
]

for s in lst:
    print(re.split(pattern, s))

Output

['1338516 -...pair - 5pk', '1409093 -...re Wax 3Pk']
['1409085 -...dtnr - 5pk', '1415090 -...accessories']
['490663 - 3 pack', '1490739 -...2 - 3 pack']

Another option could be a matching approach:

\b\d{5,7}\b.*?(?=[^\S\n] \d{5,7}\b|$)

Regex demo

CodePudding user response:

You can use

^(. ?)\s*\b(\d{5,7}\b.*)

See the regex demo.

In Python, use a raw string literal to declare this regex:

pattern = r'^(. ?)\s*\b(\d{5,7}\b.*)'

Details:

  • ^ - start of string
  • (. ?) - Group 1: one or more (but as few as possible) occurrences of any char other than line break chars
  • \s* - zero or more whitespaces
  • \b - a word boundary
  • (\d{5,7}\b.*) - Group 2: five-seven digit number, word boundary and the rest of the line.

See a Python demo:

import re
text = "1338516 -...pair - 5pk 1409093 -...re Wax 3Pk"
pattern = r'^(. ?)\s*\b(\d{5,7}\b.*)'
m = re.search(pattern, text)
if m:
    print(m.group(1)) # => 1338516 -...pair - 5pk
    print(m.group(2)) # => 1409093 -...re Wax 3Pk

If you need to use it in a Pandas dataframe, you can use

df[['result_col_1', 'result_col_2']] = df['source'].str.extract(pattern, expand=True)
  • Related