NEED INSIGHT: Using python I am using a regular expression to capture sample restaurant sales data t-CodePudding

The regular expression I am using is ^\s*(\d )\s*(([A-Za-z] \s*) )?(\d )\s (. ?)\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )$

When the following sample data string is parsed and categorized " 1 NA BEVERAGE 1100 ICED TEA 14.00 3.00 42.00 3.50 0.00 42.00 0.00 0.52 47.09"

Output is incorrect: when you look at the categorized data before converting it to JSON you see 'item_category': 'NA BEVERAGE ', 'item_number': 'BEVERAGE ' It should be 'item_category': 'NA BEVERAGE ', 'item_number':'1100' and so on.

I expect:

{'item_rank': '1', 'item_category': 'NA BEVERAGE ', 'item_number': 'BEVERAGE ', 'item_name': '1100', 'number_sold': 'ICED TEA', 'price_sold': '14.00', 'amount': '3.00', 'tax': '42.00', 'cost': '3.50', 'profit': '0.00', 'food_cost': '42.00', 'precent_sales': '0.00', 'cat_sales': '0.52'}

I tried fixing the regular expression multiple times to no avail. An explanation of what's wrong is appreciated.

Here is logic of the python script you can copy and run on your own machine:

import re
import json

page_text_str = "   1 NA BEVERAGE 1100 ICED TEA 14.00 3.00 42.00 3.50 0.00 42.00 0.00 0.52 47.09"

sale_line_re = re.compile('^\s*(\d )\s*(([A-Za-z] \s*) )?(\d )\s (. ?)\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )$')
grouped_data = []

for line in page_text_str.split('\n'):
    print(line)   
    match = sale_line_re.match(line)
    if match:
        groups = match.groups()
        item = {
            "item_rank": groups[0],
            "item_category": groups[1],
            "item_number": groups[2],
            "item_name": groups[3],
            "number_sold": groups[4],
            "price_sold": groups[5],
            "amount": groups[6],
            "tax": groups[7],
            "cost": groups[8],
            "profit": groups[9],
            "food_cost": groups[10],
            "precent_sales": groups[11],
            "cat_sales": groups[12]
        }
        grouped_data.append(item)


for sale in grouped_data:
    print(sale)

CodePudding user response：

Instead of building a regex to describe all the numbers, etc., it would be easier to use the re.split function by spaces between numbers and at the same time ignore a space between words. This function returns a list, and then you can iterate over it to build a JSON.

(?<=\d)\s|\s(?=\d)

(?<=\d), lookbehind: everything that goes after a digit
(?=\d), lookahead: everything that goes before a digit
\s|\s - matches any whitespace before or after a digit.

regex101.com

CodePudding user response：

The issue is that you are repeating a capture group, which will have the group value of the last iteration.

You can change (([A-Za-z] \s*) )? --> ((?:[A-Za-z] \s ) )

This change will:

repeat a non capture group, so now you have the whole value in group 2
make group 2 not optional anymore

repeat the whitespace chars 1 times in the repeating of the non capture group

^\s*(\d )\s*((?:[A-Za-z] \s ) )?(\d )\s (. ?)\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )\s (\d .\d )$

See the updated pattern in this regex demo and the Python code