I want to not include within my regex pattern those numbers which have the letter k
followed after it.
For example, I have the following text:
string1 = ['28k ring to be worn',
'90w20h','96k watch', 'final price']
string2 = ['28k ring to be worn',
'90.8w20.6h','96k watch', 'final price']
string3 = ['28k ring to be worn',
'90.8 w 20.6h','96k watch', 'final price']
string4 = ['28k ring to be worn',
'90.8 20.6h','96k watch', 'final price']
I wanted to extract those values which capture the second string and only their numerical values. However, my regex will also capture those numbers with the letter k. In my dataset there's always a number followed by a letter k, where the second string will have different numbers and they'll follow either one of the 4 string patterns.
I have tried the following:
for s in string1:
print(re.findall('[*0-9 ] [ .?\d ] '), s)
Which captures what I need, but it also grabs those numbers with letter k following it.
Essentially, I want as expected output:
['90','20']
['90.8','20.6']
['90.8','20.6']
['90.8','20.6']
CodePudding user response:
You can match numbers followed with k
and then match and capture any other numbers:
import re
strings = [
['28k ring to be worn', '90w20h','96k watch', 'final price'],
['28k ring to be worn', '90.8w20.6h','96k watch', 'final price'],
['28k ring to be worn','90.8 w 20.6h','96k watch', 'final price'],
['28k ring to be worn', '90.8 20.6h','96k watch', 'final price']
]
for text in strings:
matches = re.findall(r'\d (?:\.\d )?k|(\d (?:\.\d )?)', ' '.join(text), re.I)
print( [m for m in matches if m!=''] )
See this Python demo. Output:
['90', '20']
['90.8', '20.6']
['90.8', '20.6']
['90.8', '20.6']
See the regex demo.
CodePudding user response:
I tried using the pattern (\d \.?\d*?)\s*?w?\s*?(\d \.?\d*?)\s*?h
since you have multiple cases in your example.
import re
string1 = ['28k ring to be worn',
'90w20h','96k watch', 'final price']
string2 = ['28k ring to be worn',
'90.8w20.6h','96k watch', 'final price']
string3 = ['28k ring to be worn',
'90.8 w 20.6h','96k watch', 'final price']
string4 = ['28k ring to be worn',
'90.8 20.6h','96k watch', 'final price']
pattern = r"(\d \.?\d*?)\s*?w?\s*?(\d \.?\d*?)\s*?h"
output = []
for lst_strings in [string1, string2, string3, string4]:
for string in lst_strings:
search = re.findall(pattern, string)
if search:
output = search
output
Output:
[('90', '20'), ('90.8', '20.6'), ('90.8', '20.6'), ('90.8', '20.6')]
I know you wanted lists
instead of tuples
, but it is quite easy to fix.