I have strings like the following:
1338516 -...pair - 5pk 1409093 -...re Wax 3Pk
1409085 -...dtnr - 5pk 1415090 -...accessories
490663 - 3 pack 1490739 -...2 - 3 pack
What I'm trying to do is, split these strings so that the first string is 1338516 -...pair - 5pk
and the second one is 1409093 -...re Wax 3Pk
.
Currently, I'm able to extract the numbers using the following code:
list(filter(lambda k: '...' in k, reqText))
lst1 = ''.join(lst)
numbers = re.findall(r'\d ', lst1)
numbers1 = [x for x in numbers if len(x) > 3]
Any suggestions?
CodePudding user response:
You could use split with a pattern:
[^\S\n] (?=\d{5,7}\b)
Explanation
[^\S\n]
Match 1 or more spaces without a newline(?=\d{5,7}\b)
Positive lookahead, assert 5-7 digits to the right followed by a word boundary
import re
pattern = r"[^\S\n] (?=\d{5,7}\b)"
lst = [
"1338516 -...pair - 5pk 1409093 -...re Wax 3Pk",
"1409085 -...dtnr - 5pk 1415090 -...accessories",
"490663 - 3 pack 1490739 -...2 - 3 pack"
]
for s in lst:
print(re.split(pattern, s))
Output
['1338516 -...pair - 5pk', '1409093 -...re Wax 3Pk']
['1409085 -...dtnr - 5pk', '1415090 -...accessories']
['490663 - 3 pack', '1490739 -...2 - 3 pack']
Another option could be a matching approach:
\b\d{5,7}\b.*?(?=[^\S\n] \d{5,7}\b|$)
CodePudding user response:
You can use
^(. ?)\s*\b(\d{5,7}\b.*)
See the regex demo.
In Python, use a raw string literal to declare this regex:
pattern = r'^(. ?)\s*\b(\d{5,7}\b.*)'
Details:
^
- start of string(. ?)
- Group 1: one or more (but as few as possible) occurrences of any char other than line break chars\s*
- zero or more whitespaces\b
- a word boundary(\d{5,7}\b.*)
- Group 2: five-seven digit number, word boundary and the rest of the line.
See a Python demo:
import re
text = "1338516 -...pair - 5pk 1409093 -...re Wax 3Pk"
pattern = r'^(. ?)\s*\b(\d{5,7}\b.*)'
m = re.search(pattern, text)
if m:
print(m.group(1)) # => 1338516 -...pair - 5pk
print(m.group(2)) # => 1409093 -...re Wax 3Pk
If you need to use it in a Pandas dataframe, you can use
df[['result_col_1', 'result_col_2']] = df['source'].str.extract(pattern, expand=True)