Perform a replacement on a string with several calls to the re.sub() method in a specific order and-CodePudding

import re

#Example 1
input_str = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822"


#Example 2
input_str = "sumaria 6 cuatrillones 789 billones 320 mil a esta otra cantidad de elementos  47475822 y eso daría por resultado varios millones o trillones de unidades"

mil = 1000
2 mil = 2000
322 mil = 322000

1 millon = 1000000
2 millones = 2000000
1 billon = 1000000000000
25 billones = 25000000000000
1 trillon = 1000000000000000000
3 trillones = 3000000000000000000
1 cuatrillon = 1000000000000000000000000

mil = 1 digit followed 3 digits

millon = 1 digit followed 6 digits

billon = 1 digit followed 6 6 digits

trillon = 1 digit followed 6 6 6 digits

cuatrillon = 1 digital followed 6 6 6 6 digits

The difference between them is 6, always 6 digits, which if they are not complete, they are indicated as 0, since the decimal system is positional (the positions of the important digits).

When it is said in the singular, for example, millon, it is because there is always a 1 in front, that is,"1 millon" and not "1 millones" (add es for not singular) but if it is greater than 1, it will be for example "2 trillones" = 2000000000000000000 or "320 billones" = 320000000000000.

"mil" is an exception since it does not have a plural, that is, 2 thousand "2 miles" is not used but "2 mil" is placed.

The other exception is that 1 thousand "1 mil" is not written but i write only "mil" and it is understood that it is "1000"

Proto regex for xxx mil xxx

r"\d{3}[\s|]*(?:mil)[\s|]*\d{3}"

Proto regex for millon, billon, trillon and cuatrillon

r"\d{6}[\s|]*(?:cuatrillones|cuatrillon)[\s|]*\d{6}[\s|]*
(?:trillones|trillon)[\s|]*\d{6}[\s|]*(?:billones|billon)[\s|:]*\d{6}[\s|:]*(?:millones|millon)[\s|:]*\d{6}"

Output that i need obtain with some replacement method like re.sub(), this method is to place some of the regex, since the replacement must be conditioned to be in the middle of this amount of numbers to be done, otherwise it should not be done (as seen in the output of example 2)

"3000000000000320459 47475822"   #example 1

"sumaria 6000000000789000000320000 a esta otra cantidad de elementos  47475822 y eso daría por resultado varios millones o trillones de unidades"   #example 2

How could I improve my regex to be able to perform these replacements correctly? Or maybe it is better to use another method?

CodePudding user response：

(Edit: from your comment, I note that you actually want subsequent matches decreasing in size to be combined into a single number - the below doesn't answer that, I'll leave the code all the same)

Going both ways:

import re

NUMBERS = [
    (10**15, 'quatrillon', 'es'),
    (10**12, 'trillon', 'es'),
    (10**9, 'billon', 'es'),
    (10**6, 'millon', 'es'),
    (10**3, 'mil', '')
]


def num_to_name(n):
    n = int(n) if isinstance(n, str) else n

    for size, name, multi in NUMBERS:
        if n > size:
            n = n // size
            return f'{n} {name}{multi if n > 1 else ""}'
    return str(n)


def name_to_num(s):
    s = s[:-2] if s.endswith('es') else s
    for size, name, _ in NUMBERS:
        if s.lower().endswith(name):
            return int(s[:-(len(name) 1)]) * size
    return int(s)


input_str = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822"
num_str = re.sub('\d  (?:(?:quatr|tr|b|m)illon(?:es)?|mil)', 
                 lambda match: str(name_to_num(match.group(0))), input_str)
print(num_str)

name_str = re.sub('\d ', 
                  lambda match: num_to_name(match.group(0)), num_str)
print(name_str)

Output:

creo que hay 330000000000000 2000000000 18000000 320000 459 47475822
creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47 millones

Note that the final result is not exactly the input string, since the input string had some numbers that could be converted ('47 millones')

The function num_to_name(n) takes an integer (or string, converted to an integer) and finds the appropriate way to write it as a number, using the naming defined in NUMBERS. If it doesn't match any of the sizes, it just returns the number as a string.

The function name_to_num(s) takes a string and checks whether it ends in any of the names (with or without plural) defined in NUMBERS. If it does, it tries to convert the rest of the string into an integer and returns that value multiplied by the matching factor. Otherwise, it tries to just return the integer value of the string.

At the bottom, there's two regexes matching the relevant parts of the input string, using a lambda to replace the found fragments using the 2 functions.

CodePudding user response：

I think you shouldn't use pure regex for that but rather mix some clever arithmetic parsing. This is an example of how to solve it (note that it actually translates the numbers in a way that makes sense and doesn't just concat them so the results are somewhat different than what you defined as desired)

import re

input_str1 = "creo que hay 330 trillones 2 billones 18 millones 320 mil 459 47475822"
input_str2 = "sumaria 6 cuatrillones 789 billones 320 mil a esta otra cantidad de elementos  47475822 y eso daría por resultado varios millones o trillones de unidades"


def wrap_word(word: str) -> str:
    return fr"(\d )\s \b{word}\b"


def wrap_num(num: int) -> str:
    return f"\\1*{str(num)}"


def eval_mult_exp(text: str) -> str:
    for op1, op2 in re.findall("(\\d )\*(\\d )", text):
        text = re.sub(pattern=op1 "\*" op2, repl=str(int(op1)*int(op2)), string=text)
    return text


def eval_addition_exp(text: str) -> str:
    if not re.search("(\\d ) (\\d )", text):  # recursion halting condition
        return text

    for op1, op2 in re.findall("(\\d ) (\\d )", text):
        text = re.sub(pattern=op1 " " op2, repl=str(int(op1) int(op2)), string=text)
    return eval_addition_exp(text)


def word_to_num(word: str) -> str:
    for pattern, numeric_replacement in [
        (wrap_word("mil"), wrap_num(10**3)),
        (wrap_word("millones(es)?"), wrap_num(10**6)),
        (wrap_word("billon(es)?"), wrap_num(10**9)),
        (wrap_word("trillon(es)?"), wrap_num(10**12)),
        (wrap_word("cuatrillon(es)?"), wrap_num(10**15)),
    ]:
        word = re.sub(pattern, numeric_replacement, word)
    return word


print(eval_addition_exp(eval_mult_exp(word_to_num(input_str2))))

Out[1]:

sumaria 6000789000320000 a esta otra cantidad de elementos 47475822 y eso daría por resultado varios millones o trillones de unidades

Execuse my Spanish :)