Home > Enterprise >  How can I extract specific data from an Google OCR dump text csv file of till receipts
How can I extract specific data from an Google OCR dump text csv file of till receipts

Time:10-19

I am just a beginner learning Python for ML. I am self learning NLP and I have a problem. I have a csv file with OCR read till receipts (100 observations)...one receipt per cell. An example of one receipt is as follows (info in csv reads in this manner):

text = '''ㅎㅎ
Fresh Food
The fresh food people
5619 Lorem PH: 00 0000 0000
204 Some Road
TAX INVOICE - ABN 88 000 014 675
Gift Card Visa Varbl 20to500Dollars
Potato White Washed 2Kg
Nescafe Coffee Blend 43 50g
^HToblerone Milk Chocolate Bar 50g
^HOreo Cookie Original 133g
#M&Ms Crispy 145g
*HRed Rock Deli Portgse Chicken 150g
^HWrigleys Extra Pepprmint 14pc 27g
Qty
57.95
3.50
5.00
0.90
1.50
4.50
3.50
2 @ $1.80
each
3.60
$80.45
9 SUBTOTAL
TOTAL
$80.45
REWARDS SAVINGS
$10.00
Fresh Food
5619
Lorem'''

While, the receipts in general are unstructured (ie. some have more lines and subsidiary info than the other) there are 3 things that are quite structured in its layout and I am attempting to extract that info: SUBTOTAL amount (always appears before TOTAL), TOTAL amount(always appears after SUBTOTAL), and list of items bought (always appears after ABN number and ends before 'Qty')

I started like this:

re.split(r'\s ', text) --->splits the string into a list of words

['ㅎㅎ', 'Freah', 'Food', 'The', 'fresh', 'food', 'people', '5619', 'Unley', 'PH:', '00', '0000', '0000', '204', 'Some', 'Road', 'TAX', 'INVOICE', '-', 'ABN', '88', '000', '014', '675', 'Gift', 'Card', 'Visa', 'Varbl', '20to500Dollars', 'Potato', 'White', 'Washed', '2Kg', 'Nescafe', 'Coffee', 'Blend', '43', '50g', '^HToblerone', 'Milk', 'Chocolate', 'Bar', '50g', '^HOreo', 'Cookie', 'Original', '133g', '#M&Ms', 'Crispy', '145g', '*HRed', 'Rock', 'Deli', 'Portgse', 'Chicken', '150g', '^HWrigleys', 'Extra', 'Pepprmint', '14pc', '27g', 'Qty', '57.95', '3.50', '5.00', '0.90', '1.50', '4.50', '3.50', '2', '@', '$1.80', 'each', '3.60', '$80.45', '9', 'SUBTOTAL', 'TOTAL', '$80.45', 'REWARDS', 'SAVINGS', '$10.00', 'Fresh', 'Food', '5619', 'UNLEY', 'SA', 'TCDM', 'ID.', 'Thank', 'you', 'tor', 'shopping', 'with', 'us', 'STORE', '5619', 'POS', '065', 'TRANS', '8660', '13:39', '04/08/2021']

x = re.search('SUBTOTAL', text)
print(x.group())
y = re.search('TOTAL', text)
print(y.group())

SUBTOTAL

TOTAL

Now I am attempting to get the amounts...so the number before SUBTOTAL (9) and the number after TOTAL ($80.45)...I tried looping through the text list and tried regular expressions, but I am not able to get what I want....Also, I am stumpped as to how I can extract the items bought (string of text between 'ABN 88 000 014 675' and 'Qty' although the split has split the ABN and the number as different words as well...so its again a problem.

What am I doing wrong? and how can I solve this? I thought of looping through text:

for sentence in text:

(but again, then what...how to get to the lines that I want)

Any help will be much appreciated.

CodePudding user response:

You can use the regular expression

(?s)^(?=.* ABN \d{2}(?: \d{3}){3}\n(?P<products>.*\n)Qty\n).*\n\$(?P<subtotal>\d \.\d{2})\n\d  SUBTOTAL\.*\nTOTAL\n\$(?P<total>\d \.\d{2})\n

Capture group products contains the substring of products:

"Gift Card Visa Varbl 20to500Dollars\nPotato White Washed 2Kg\nNescafe Coffee Blend 43 50\n^HToblerone Milk Chocolate Bar 50g\n^HOreo Cookie Original 133\n#M&Ms Crispy 145g\n*HRed Rock Deli Portgse Chicken 150g\n^HWrigleys Extra Pepprmint 14pc 27g(?.*\nz"

which, if desired, can be splint on newlines.

Capture group subtotal contains the subtotal:

9

Capture group total contains the total:

80.45

Demo

The question does not specify the location of the products. Presumably it could precede or follow the subtotal and total. It is for that reason that I've captured them in a positive lookahead that follows the start-of-string anchor ^. That lookahead does not move the engine's internal string pointer beyond the start of the string, ensuring that the subtotal and then total, if present, are found.

The regular expression performs the following operations.

(?s)             # cause periods to match all chars, including newlines
^                # match beginning of string
(?=              # begin positive lookahead
  .*             # match zero or more chars
  \ ABN\         # match ' ABN '
  \d{2}          # match 2 digits
  (?:            # begin a non-capture group
    \ \d{3}      # match a space followed by 3 digits
  ){3}           # end non-capture group, execute thrice 
  \n             # match newline
  (?P<products>  # begin a capture group named 'products'
    .*           # match zero or more chars including newlines           
    \n           # match newline
  )              # end capture group
  Qty\n          # match 'Qty', newline
)                # end positive lookahead
.*\n             # match zero or more chars, newline
\$\d \.\d{2}\n   # match '$', one or more digits, period, 2 digits, newline
(?P<subtotal>    # begin capture group named 'subtotal'
  \d             # match one or more digits
)                # end capture group
\  SUBTOTAL\n    # match one or more spaces, 'SUBTOTAL', newline
TOTAL\n          # match 'TOTAL', newline
\$               # match '$'
(?P<total>       # begin capture group named 'total'       
  \d \.\d{2}     # match one or more digits, period, 2 digits
)                # end capture group
\n               # match newline

I've escaped spaces above to make them more visible.

  • Related