Home > database >  Extract digits from string using snowflake
Extract digits from string using snowflake

Time:01-06

I have a dataset that contains some of the following strings:

  1. some text here 123456 some text here
  2. some text here #123457 some text here
  3. some text here 123458. Some text here
  4. Some invoice 123245. Some text with 543903 and 34550
  5. Two invoices 124356 and 235478 and some products 6783 and 45639
  6. invoice 230943 and invoice 320399. Some text here
  7. inv #430203 and #404039. some text here
  8. Some invoice 134045 and some text 30 3

I am trying to extract only the invoice number from the string, which is always 6 digits and it's always the first numbers on my string.

On cases 1, 2 and 3, I am using regexp_replace(memo, '[^0-9]', '') to get the numbers out of the string and it works perfectly. The problem are the other cases. I don't care about the ones that have less than 6 digits, so they could just be left out. Then, I started using right(regexp_replace(memo, '[^0-9]', ''),6) instead. My problem is the examples 5, 6 and 7, instead of showing only the first 6 numbers, I would like to show something like "Multiple Invoices", or anything that could be flagged.

My ideal output would be:

  1. 123456
  2. 123457
  3. 123458
  4. 123245
  5. Multiple Invoices
  6. Multiple Invoices
  7. Multiple Invoices
  8. 134045

Or instead of showing multiple invoices, to show the second invoice on another column would actually be perfect:

Inv 1 Inv 2
123456
123457
123458
123245
124356 235478
230943 320399
430203 404039
134045

Some considerations are:

  • the invoices on the memo are always 1 or 2, never more than 2 invoices.
  • The invoices always have 6 digits.
  • There's always an "AND" in between the invoice numbers, but it may contain some other stuff like the examples 6 and 7.
  • It's rare, but it can occur to have another 6 digits in sequence that are not invoice, but they are going to be like the example 4, with some text after the invoice number, and never have an "AND" in between the invoice number and the crap 6 digits I don't care.

Is there a way to achieve that without using UDF? Or is there a way to achieve that at all? I am using snowflake SQL.

Thank you so much for your help.

CodePudding user response:

We should be able to use a regex substring approach here:

SELECT
    memo,
    CASE WHEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 2) IS NULL  -- no 2nd invoice
         THEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 1)
         ELSE 'Multiple Invoices' END AS label
FROM yourTable;

The above logic checks for a second invoice number (whose regex pattern is \b\d{6}\b). If not found, it displays just the single invoice number, otherwise it shows the text "Multiple Invoices."

CodePudding user response:

The key is handling example number 4. In addition to checking whether or not there is a second match for six digits, it's necessary to make sure that it's an invoice and not a random 6 digits. One way to do that by nesting regexp_substr. The inside regexp_substr looks for the keywords "invoice", "inv", or "and" immediately before the number. It then looks for six digits with or without the # character prefixing it. The outer regexp_substr strips out the lead-in and returns only the six digits.

select   MEMO
        ,regexp_substr(MEMO, '\\b\\d{6}\\b', 1, 1) INVOICE_1
        ,case when INVOICE_1 is not null then
            regexp_substr(regexp_substr(MEMO, '(and|and\\s invoice|and\\s inv)[\\s#]\\d{6}\\b', 1, 1), '\\d{6}')
            else null
         end as INVOICE_2
from T1
;
MEMO INVOICE_1 INVOICE_2
some text here 123456 some text here 123456 null
some text here #123457 some text here 123457 null
some text here 123458. Some text here 123458 null
Some invoice 123245. Some text with 543903 and 34550 123245 null
Two invoices 124356 and 235478 and some products 6783 and 45639 124356 235478
invoice 230943 and invoice 320399. Some text here 230943 320399
inv #430203 and #404039. some text here 430203 null
Some invoice 134045 and some text 30 3 134045 null

This may not cover every contingency, but it should be close. You can alter the regular expression to filter out noise and capture signal.

  • Related