Home > Enterprise >  How to improve my pattern for regex parsing?
How to improve my pattern for regex parsing?

Time:04-28

I want to change the pattern so that it does not only match strings with both unit and amount but also unit alone. For instance, I want it to match "cubes" as well, even though it does not have an amount listed. Similarly, if the string just has the amount and not the unit, I want it to match the amount alone too. Currently, the output returned is

['1.0', '0.07', '32.0', '0.12', '1.01', 'cubes', '2']

I want the output to be as follows:

['1.0', '0.07', '32.0', '0.12', '1.01', '1.0', '2.0']

Here is the code:

list_of_texts = ["1oz", "2ml", "4cup", "1 wedge","2 slices", "cubes", "2"]

pattern = r"(^[\d -/] )(oz|ml|cl|tsp|teaspoon|teaspoons|tea spoon|tbsp|tablespoon|tablespoons|table spoon|cup|cups|qt|quart|quarts|drop|drop|shot|shots|cube|cubes|dash|dashes|l|L|liters|Liters|wedge|wedges|pint|pints|slice|slices|twist of|top up|small bottle)"


new_list = []

for text in list_of_texts:
    re_result = re.search(pattern, text)

    if re_result:
        amount = re_result.group(1).strip()
        unit = re_result.group(2).strip()
        print(amount)
        print(unit)

        if "-" in amount:
            ranged = True
        else:
            ranged = False

        amount = re.sub(r"(\d) (/\d)",r"\1\2",amount) 
        amount = amount.replace("-"," ").replace(" "," ").strip()
        amount = re.sub(r"[ ] "," ",amount)
        amount_in_dec = frac_to_dec_converter(amount.split(" "))
        amount = np.sum(amount_in_dec)

        if ranged:
            to_oz = (amount*liquid_units[unit])/2
        else:
            to_oz = amount*liquid_units[unit]

        new_list.append(str(round(to_oz,2)))

    else:
        new_list.append(text)

Note: I have a dictionary that has conversion units

CodePudding user response:

Make the number optional by using * instead of . Then if the first capture group is empty, treat it as 1.0.

pattern = r"(^[\d -/]*)(oz|ml|cl|tsp|teaspoon|teaspoons|tea spoon|tbsp|tablespoon|tablespoons|table spoon|cup|cups|qt|quart|quarts|drop|drop|shot|shots|cube|cubes|dash|dashes|l|L|liters|Liters|wedge|wedges|pint|pints|slice|slices|twist of|top up|small bottle)"

for text in list_of_texts:
    re_result = re.search(pattern, text)

    if re_result:
        amount = re_result.group(1).strip()
        if amount == '':
            amount = '1.0'
        unit = re_result.group(2).strip()
        print(amount)
        print(unit)

        # rest of your code
  • Related