Building a number compression algorithm-CodePudding

I am trying to solve this problem:

Consider a simple number compression algorithm that works as follows:

111155522500 -> [('1', 4), ('5', 3), ('2', 2), ('5', 1), ('O', 2)]

The algorithm goes from left to right through each digit and returns a list of two-element tuples. Each tuple consists of a digit and the number of repetitions of a given digit until the next, different digit in the number is encountered.

Implement a function called compress () that compresses number as described above.

Examples:

[IN]: compress(lll)
[OUT]: [('l', 3)]
[IN]: conpress(1000000)
[OUT]: [('1', 1), ('O', 6)]
[IN]: conpress(10005000)
[OUT]: [('1', 1), ('O', 3), ('5', 1), ('O', 3)]

My code:

#para funcionar, é e necessário acrescentar um caracter que não  existe ao final da string!
def compress(elemento):
    elemento = str(elemento)
    elemento = elemento  "§"# the code only works if I add some char not in the string
    saida = []
    #cont = 0
    contagem = []
    
    for index,ele in enumerate(elemento):
        ele = str(ele)
        #print(f"contagem: {contagem}")
        if len(contagem) == 0 or ele in contagem:
            contagem.append(ele)
       
        if ele not in contagem :
            saida.append((elemento[index-1],len(contagem)) )
            contagem = []
            contagem.append(ele)                      
        
    return saida



compress('1000500011')

OUT:

[('1', 1), ('0', 3), ('5', 1), ('0', 3), ('1', 2)]

The problem: The code only works if I add some char not in the string as "§" Could someone help to solve this problem using the itertools built-in module and the groupby and without this modules?

CodePudding user response：

The earlier post is a perfect example of groupby. But if you're open to more_itertools, here is another version: run_length. It basically doing something similar with groupby - compresses an iterable with run-length encoding. It yields groups of repeated items with the count of how many times they were repeated:

It even provides corresponding decode to decompress the compressed one. Please check it out. So there is no need to reinvent the wheels...

from more_itertools import run_length

num = 10005000

compressed = list(run_length.encode(str(num)))

Output:

[('1', 1), ('0', 3), ('5', 1), ('0', 3)]

So running it in IDE like this:
>>> num = 10005000
>>> compressed = list(run_length.encode(str(num)))
>>> compressed
[('1', 1), ('0', 3), ('5', 1), ('0', 3)]
>>> run_length.decode(compressed)
<itertools.chain object at 0x039931A8>
>>> list(run_length.decode(compressed))
['1', '0', '0', '0', '5', '0', '0', '0']
>>> num = [int(x) for x in run_length.decode(compressed)]
>>> num
[1, 0, 0, 0, 5, 0, 0, 0]

CodePudding user response：

Pretty straight forward with itertools.

def compress(number):
    return [
        (c, sum(1 for _ in g))
        for c, g in itertools.groupby(str(number))
    ]

demo:

>>> compress(10005000)
[('1', 1), ('0', 3), ('5', 1), ('0', 3)]

CodePudding user response：

Corrections to your code with '> Edit---' comments to highlight changes.

Code

def compress(elemento):
    elemento = str(elemento)  # > Edit-----not necessary, since already a string
    
    # > Edit-----Remove adding last character
    #elemento = elemento  "§"# the code only works if I add some char not in the string
    
    saida = []
    #cont = 0
    contagem = []
    
    for index,ele in enumerate(elemento):

        #ele = str(ele)   # > Edit-----not necessary, since already a string
        #print(f"contagem: {contagem}")
        #f len(contagem) == 0 or ele in contagem:     # > Edit-----Inefficient to check entire list
        if len(contagem) == 0 or ele == contagem[-1]: #            More efficient just to compare last element in list
            contagem.append(ele)
       
        #if ele not in contagem :   # > Edit-----Inefficient to check entire list
        if ele != contagem[-1]:     #             More efficient just to compare last element in list
            saida.append((elemento[index-1],len(contagem)) )
            contagem = []
            contagem.append(ele)   
          
    # > Edit-----Not having next if block caused your error
    #             Add last contagem  if it exists
    if contagem:
        saida.append((elemento[index-1], len(contagem)) )
        
    return saida

compress('1000500011')

Output

[('1', 1), ('0', 3), ('5', 1), ('0', 3), ('1', 2)]

Code Refactoring

Code can be simplified to the following:

def compress(elements):
    stack = []
    for ele in elements:
        if not stack or stack[-1][0] != ele:
            stack.append((ele, 1)) 
        else:
            stack[-1] = (stack[-1][0], stack[-1][1]   1)
            
    return stack