Home > Enterprise >  Regular Expression to Search Quantity in Human Written Descriptions
Regular Expression to Search Quantity in Human Written Descriptions

Time:07-01

Hello and thank you in advance,

I buy items that have a variety of human written listings on auction sites and forums. Often times, the quantity is clear to a person, but extracting it has been a real challenge. I'm using google sheets and REGEXEXTRACT().

I consider myself to be a intermediate regex user, but this has me stumped, so I need an expert.

Here's a few examples, my desired return, and what I'm getting.

Listing Desired Return Actual Return
Red 1996 Corvette 2x - Matchbox 2 2
3 x SmartCar, broken 2nd door 3 3
2nd edition Kindle (x3) 3 3
**1x** 2008 financial crash notice 1 1
Collectors Edition Beannie Baby, item 204/343 1 4
(6) Nissan window motors (1995-1998 ONLY) 6 N/A
White chevy F150, 1996 1 6
Green bowl, cracked (stored in room 2A5) 1 5

As I thought through this, I think I can put some reasonable limitations on this logic, but the code is harder.

  • The quantities will only be a single number 1-9. (perhaps reject all numbers > 9?)
  • They'll possibly be precede by or followed by an X or x, with or without a space
  • The quantity may be next to a special character like * , () or -
  • It should ignore all 1st, 2nd, 3rd, - 9th style notation
  • If a number is mixed in a word, like 2A3, it should ignore all

Obviously most description don't have any quantity, so if there's no return or zero, that's fine.

I have something that feels close, and does a reasonable job:

[^a-wy-zA-WY-Z0-9]*([1-4]){1}([^a-wA-w0-9]|$)

It doesn't return anything with the returns marked of 1*, and that's fine. It breaks on the last two, and I've struggled for too long!

Thanks in advance!

CodePudding user response:

You can use

=IFNA(INT(REGEXEXTRACT(REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d] \d\w*|\d [^\W\d]\w*", "$1$2"), "(\d)")), 1)

Here,

  • REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d] \d\w*|\d [^\W\d]\w*", "$1$2") finds and removes chunks of two or more digits, or chunks with a digit and at least one letter, but keeps the sequences where a digit is preceded or followed with x
  • REGEXEXTRACT(..., "(\d)")) extracts the first digit left after the replacement
  • =IFNA(INT(...), 1) either casts the found digit to integer, or, if there was no match, inserts 1 into the column.

See the enter image description here

  • Related