Regex to extract a number (format xxxxx.yy - no more than two decimal places) from complex string --CodePudding

The requirement for this arose in answering this question here which has only been partially completed, at least from my POV.

For all tables and data SQL test harness, see the fiddle here.

I have a table and data as follows:

CREATE TABLE payment (id INT, amount TEXT NOT NULL);

and

INSERT INTO payment VALUES                          -- Desired result

(1, 'KES 0.80__asdfa   .80..98..00sadf   '),        --  0.80 or .8 or .80
(2, '00 a 0..0...00asafd 0000013..0.85...0000'),    --  0.85   or .85
(3, '0sf00..0...0 00.000 X013....12.851...0000'),   --  12.85
(4, '000..0...007600.00 0013..12.....0000'),        --  7600
(5, '0afs0 sdff 00...00000.56.....000343..343.0'),  --  0.56 or .56
(6, 'fs0 sdff 00...0.000710x00.56..asfd0003..3.0'); --  0.56 or .56, NOT 0.00071 or 710

OK, so I want to pull out valid numbers of the form xxxx.yy where the number of x's is arbitrary and the number of y's can be at most 2 (but being able to specify #x would be the icing on the cake).

Explanation of requirement:

1 - 0.80 - should be self-evident - it's the first valid number with 2 digits after the decimal.
2 - 0.85 and not 13 because 13.. isn't a valid number - if it was 13.00 (or two other digits), then OK, but not 13.0. for example.
3 - 12.85 and not 12.851 because of the 2 y's rule.
4 - 7600 fairly obvious. However 7,600 would not be valid - no separators.
5 - 0.56 - fairly obvious.
6 - Now, this one will separate the wheat from the chaff... it has to be 0.56 and not 0.00071 (2 y's), but, it also can't be 710 because 710x is not valid. A valid number is a run of digits, a decimal point, and then at least two (or more) valid digits which should be truncated.

The truncation rule is not absolute - I can use PostgreSQL to round!

Just to help you on your way :-), I've included the fiddle from my answer - my regex is:

SELECT
  REGEXP_REPLACE(amount, '[^0-9\.] |\.  |\.0|\.{2,}', '', 'g')
FROM payment;

The sample data for this question is at the bottom and also on a 2nd fiddle here. The problem with my regex, I feel, is that it's too "tailored" - with too many pipes (|) for specifics. I'd like something more generic - hence this question.

A few points to note:

I've left a "test harness" in the fiddle so that solutions can be checked. I've done this because in a previous regex question, I sent poor Wiktor Stribiżew (very helpful guy!) all round the houses before we realised that there are "quirks" with PostgreSQL's implementation of regexes that rendered an answer impossible, notably that: Lookahead and lookbehind constraints cannot contain back references (see Section 9.7.3.3), and all parentheses within them are considered non-capturing.

I don't think that applies in this case, but what I don't want is to be going from one regex site to the other with working code and then finding that it doesn't work on PostgreSQL - so to save wasting anybody's time, I set this up so that you could just put your code into the pattern area specified and run a PostgreSQL fiddle?
a note about PostgreSQL's REGEXP_REPLACE() function.

REGEXP_REPLACE(source, pattern, replacement [, flags ]).

The replacement I've used is '' - i.e. empty string and I've used the 'g' flag (global). If that flag isn't present, then it only does the first occurrence. This might be of use to anybody considering doing two passes or some other solution - can't hurt to know it.
I'm actually interested in learning how to use regular expressions. Check out the last snippet in this fiddle to see what contortions I got up to before I started to go down the regex route - 30 lines down to 3 with regexes. So, please, if you could comment your regex and explain step by step what it does, that would be great, thanks.
And finally, any (small-improvments/comments on my own original fiddle's regex appreciated - if you have time! Anything else you think helpful would be great!

I've endeavoured to be thorough with this question and I hope that the issue is crystal! However, it's tricky enough stuff, so should you require any further input on my part (esp. on the PostgreSQL side), I'd only be delighted to provide any further necessary clarification!

===========================================

Sample (simpler) data for working regex (fiddle here):

INSERT INTO payment VALUES
(1,  'KES 0.80'),             
(2,  'KES  0.80'),            
(3,  'KES .80'),           
(4,  'xyzKES 0.80'),          
(5,  'KES . 0.80'),           
(6,  'KES 0.80afasf'),          
(7,  'KES 0.80__asdfa   ..'),
(8,  'KES 0.80..asdfasdf'),
(9,  'KES 0.8.0..asdfasdf'),
(10, 'KES 0.8.0    ..asdfasdf...');

CodePudding user response：

This isn't a purely regex based solution, but it does produce the desired output:

Query #1

select distinct on (id) -- 5
  id
, amount
, mat
, cast(mat as numeric (10, 2)) mat_num -- 6
from payment
join lateral (
  select 
    array_to_string(_mat, '') mat -- 2
  , rn
  from regexp_matches(amount, '\d \.\d{1,2}', 'g') -- 1
       with ordinality as x(_mat, rn) -- 3 
) x on true
where mat !~ '^0 \.0 $' -- 4
order by id, rn; -- 5

id	amount	mat	mat_num
1	KES 0.80__asdfa .80..98..00sadf	0.80	0.80
2	00 a 0..0...00asafd 0000013..0.85...0000	0.85	0.85
3	0sf00..0...0 00.000 X013....12.851...0000	12.85	12.85
4	000..0...007600.00 0013..12.....0000	007600.00	7600.00
5	0afs0 sdff 00...00000.56.....000343..343.0	00000.56	0.56
6	fs0 sdff 00...0.000710x00.56..asfd0003..3.0	00.56	0.56

View on DB Fiddle

What is happening here:

Use the function regexp_matches with the flag 'g' to extract all sub strings that match <digit>.<digit>{1, 2}. Note that this will not match a substring if it does not have decimal digits. I'm not sure if that is important for you, but all the matches in your example seem to have decimal points.
regexp_matches returns multiple rows for each match, where each row is an array of the matched groups. since we only have 1 group, this array will always have 1 element. It is necessary to convert the array back to a string for further processing.
Also important to assign a sequence id to the match, as we want to keep the first valid match, as we don't want the ordering to be undefined. 1, 2 & 3 are all specified in the lateral join-ed query
since our regex pattern will also match 00.00 and the like, it is important to get rid of these via the where clause
to pick the first valid match, distinct on is used with the rows ordered by the match sequence (rn)
casting the matched string to numeric gets rid of any leading or trailing 0s.