In the text corpus I'm working on, there are various types of numbers to be captured. Specifically, the types of numbers including:
- 10 000
- 10,000
- -3.33 × 10-3
- 8.× 104 (this is actually 8.×10**4, no whitespace between 8. and ×)
- 5×104 (this is actually 5×10**4)
- 12.123
- 12
How can I construct a regular expression that can safely capture all the seven types of numbers? I started from the regular expression that captures floating point numbers [ -]?(?:[0-9]*\.?)[0-9]
, and came up with a solution [ -]?(?:[0-9]*[\.\,\s\×])?\s?[0-9] \-?[0-9]?
. However, this cannot cover all the seven possibilities, and it seems the resulting solution will be tedious if I simply make more modifications on this.
Is there an elegent solution?
UPDATE
Based on EatenbyaGuru's advice, I came up with three regular expressions for covering the seven possibilities
digit_part0 = r"[ -]?[0-9]*\.[0-9] \s?(?:×\s?10-?[0-9])?"
digit_part1 = r"[ -]?[1-9][0-9]*\s?(?:×\s?10-?[0-9])?"
digit_part2 = r"[ -]?[1-9][0-9]*[\s,][0-9] (?:\.[0-9]*)?"
in which digit_prat0
is for covering case #3, 4 and 6; digit_part1
is for covering case # 5 and 7; digit_part2
is for covering case # 1, 2. At this point, there can be overlapping cases between digit_part0
and digit_part1
.
CodePudding user response:
There is no safe generic way to do this. You can only work from your input data and cover the cases that are in there, which always means that you need to adapt and make compromises.
The following regex targets the numbers defined in your sample:
-?\d[\d .,]*\b
and matches like this:
- 10 000 (→
10 000
) - 10,000 (→
10,000
) - -3.33 × 10-3 (→
-3.33
,10
,-3
) - 8.× 104 (→
8
,104
) - 5×104 (→
5
,104
) - 12.123 (→
12.123
) - 12 (→
12
)
If you want to match expressions, you could say X(?:YX)*
, where X
is the regex for numbers, and Y
the regex for allowed operators, including surrounding fluff (e.g. whitespace).
So if we say the allowed operators should be \.? *(?:×|-) *
for now (the \.
is only in there to cover your 8.× 104
case), you would end up with:
-?\d[\d .,]*\b(?:\.? *(?:×|-) *-?\d[\d .,]*\b)*
which matches like this:
- 10 000 (→
10 000
) - 10,000 (→
10,000
) - -3.33 × 10-3 (→
-3.33 × 10-3
) - 8.× 104 (→
8.× 104
) - 5×104 (→
5×104
) - 12.123 (→
12.123
) - 12 (→
12
)
I'm sure you will find cases where this is not specific enough, or not generic enough. Update the "number" and "operator" components as needed.
For example, -?\d[\d .,]*\b
might be too simplistic. There is nothing that stops it from matching things like 1,,,,,0
or 10000.0,0.0,0
. If things like this won't occur in your input data, it's probably fine as is. If you need to make it smarter so it recognizes legal digit grouping or discards nonsensical delimiter combinations, make it smarter. It all depends.