I am doing ocr, saving them in a text file and extracting floating point numbers from that result. Numbers sometimes comes with $ sign, comma at the 1000th spot and can range from 0.01 to 99,999.00. It can be bigger but it covers almost all of it. My code is doing a fantastic job for the most part but always missing the single digit numbers. So if the number falls between 0.01 to 9.99 this pattern is not picking it up. It picks up everything else.
Obviously I personally don't see anything wrong with this code, so I am looking for suggestions to improve and pick up single digit numbers as well.
Here is the code. \d{1,2}? means there can be 0 or 1 occurrence of 1 or 2 digits. \ ,? means there can be 0 or 1 occurrence of a comma. \d{1,3} means there will be maximum 3, minimum 1 digit. \ . means there will be a dot. \d{2} means there will be 2 digits after that dot
That is how I interpret my pattern. I know the pattern is the issue here because tried doing other things with the same pattern with same intention of picking all the numbers, but it religiously misses all the single digit numbers as if they don't exist. Need to change that. Any and all suggestions are welcome. Thank you.
#extract the numbers and print them
import re
textfile = open('result.txt', 'r')
pattern = re.compile(r'\d{1,2}?\,?\d{1,3}\.\d{2}')
for line in textfile:
matches =pattern.findall(line)
print(matches)
with open("result.txt", "w") as f:
f.write(str(matches))
CodePudding user response:
\d{1,2}?
does not match 0 or 1 occurrences of 1 or 2 digits; the ?
simply makes the match lazy (see note below). So when you have an input of (for example) 0.01
, the \d{1,2}?
matches the 0
and then there is nothing for the \d{1,3}
to match.
Since I presume you don't want to match something like ,123.45
, you should just make the entire leading digits plus comma part optional i.e.
(?:\d{1,2},)?\d{1,3}\.\d{2}
Note If you put a capture group around \d{1,2}?
, you'll see that with an input of 123.45
it matches 1
(lazy), where if you take the ?
away it will match 12
(greedy). In both cases it matches at least 1 digit.
Sample python code:
import re
pattern = re.compile(r'(?:\d{1,2},)?\d{1,3}\.\d{2}')
text = '''0.01
99.34
4,234.01
1,234
23.0
99,999.99'''
for t in text.split('\n'):
print(pattern.findall(t))
Output:
['0.01']
['99.34']
['4,234.01']
[]
[]
['99,999.99']