Split string by indentation blocks in python-CodePudding

I'm trying to split a string by indentation blocks for a save editor I'm making. Example: I would input:

"""
def foo():
    bar()
    oof()
def lol():
    foo()
"""

and it would output [["def foo():"], [" bar()"], [" oof()"], ["def lol():", "foo()"]].

Here's my code:

def splitByIndentation(data: str):
    indented = False
    prevLine = ''
    currentBlock = []
    output = []

    for line in data.split('\n'):
        if line.startswith('\t') or line.startswith('    '):
            if not indented:
                currentBlock = [prevLine]
                indented = True

            currentBlock.append(line)
        else:
            if indented:
                output.append(currentBlock)
            indented = False
        
        prevLine = line

    return output

if __name__ == '__main__':
    print(splitByIndentation("""
def foo():
    bar()
    oof()
def lol():
    foo()"""))

When I run it, it only outputs [['def foo():', ' bar()', ' oof()']].

CodePudding user response：

I am not sure if I got your question right. But I guess you want an output like

[['def foo():', ' bar()', ' oof()'], ['def lol():', ' foo()']]

I have shortened and corrected your code a bit.

def splitByIndentation(data : str):
    indented = False
    output = []
    prevLine = ''
    for line in data.split('\n'):
        if line.startswith('    ') or line.startswith('\t'):
            if not indented:
                output.append([prevLine])
                indented = True
            output[-1].append(line)
        else:
            prevLine = line
            indented = False
    return output

s = """
def foo():
    bar()
    oof()
def lol():
    foo()
"""
print(splitByIndentation(s))

The Output looks like this:

[['def foo():', '    bar()', '    oof()'], ['def lol():', '    foo()']]

The problem with this solution is that if you have an empty indented line in your input like

"""
def foo():
    bar()
    oof()
    
def lol():
    foo()
"""

The Output would look like this:

[['def foo():', '    bar()', '    oof()', '    '], ['def lol():', '    foo()']]

To fix this just remove all empty lines before iterating.

...
lines = [line for line in data.split('\n') if line.strip() != '']
for line in lines:
    ...

Of course you can easily remove any leading whitespaces in your output by just using strip() before appending the line to the block.

output[-1].append(line.strip())

The full function would look like this:

def splitByIndentation(data : str):
    indented = False
    output = []
    prevLine = ''
    # remove empty lines
    lines = [line for line in data.split('\n') if line.strip() != '']

    for line in lines:
        if line.startswith('    ') or line.startswith('\t'):
            if not indented:
                output.append([prevLine])
                indented = True
            output[-1].append(line.strip()) # append to last block and remove whitespace
        else:
            prevLine = line
            indented = False
    return output

s = """
def foo():
    bar()
    oof()
    
def lol():
    foo()
"""
print(splitByIndentation(s))

This gives you the output:

[['def foo():', 'bar()', 'oof()'], ['def lol():', 'foo()']]

CodePudding user response：

I don't think it's the same thing you need in your use case but I'm posting as I think it might be helpful.

Not sure about your use case, but if I faced a situation that requires me to be able to access the lines based on their indentations, I'd use a dictionary for better accessibility mainly.

The key would be the indentation level and the value would be the list of lines that are in that level of indentation.

The code should be something like:

from collections import defaultdict

def splitByIndentation(data: str):
    output = defaultdict(list)
    for line in data.split('\n'):
        intdentationCount = (len(line) - len(line.strip()))//4 # Assuming the indentation is always 4 spaces
        output[intdentationCount].append(line.strip())

    return dict(output)

if __name__ == '__main__':
    print(splitByIndentation(
"""def foo():
    bar()
    oof()
def lol():
    foo()
        #fofo()"""))

The output will look like:

{0: ['def foo():', 'def lol():'], 1: ['bar()', 'oof()', 'foo()'], 2: ['#fofo()']}

CodePudding user response：

The problem with your code is that it will append the currentBlock top output only after indented has become true. Now, in order for this to happen for your last block you actually need 1 last element for which line.startswith('\t') or line.startswith(' ') will not equal True.

Your original string is actually correct:

"""
def foo():
    bar()
    oof()
def lol():
    foo()
"""

This is because this string will end with \n ('\ndef foo():\n bar()\n oof()\ndef lol():\n foo()\n'), so that data.split('\n')[-1] will become ''. This would have gotten you the answer you are looking for. Unfortunately, this is not the string you are passing to the function at the end.

The actual string passed is:

"""
def foo():
    bar()
    oof()
def lol():
    foo()"""

Now, this string ends with ' foo()'. Check:

"""
def foo():
    bar()
    oof()
def lol():
    foo()""".split('\n')[-1]

So, with this knowledge we can slightly alter your function a bit to make sure that it works either way:

def splitByIndentation(data: str):
    indented = False
    prevLine = ''
    currentBlock = []
    output = []

    lines = data3.split('\n')
    if lines[-1].strip() != '':
        lines.append('')
    else:
        lines[-1] = lines[-1].strip()

    for line in lines:
        
        if line.startswith('\t') or line.startswith('    '):
            if not indented:
                currentBlock = [prevLine]
                indented = True

            currentBlock.append(line)
        else:
            if indented:
                output.append(currentBlock)
            indented = False
        
        prevLine = line

    return output

Let's give her a spin:

data1 = """
def foo():
    bar()
    oof()
def lol():
    foo()"""

data2 = """
def foo():
    bar()
    oof()
def lol():
    foo()
"""
    
data3 = """
def foo():
    bar()
    oof()
def lol():
    foo()
    """

print(splitByIndentation(data1))
# [['def foo():', '    bar()', '    oof()'], ['def lol():', '    foo()']]

print(splitByIndentation(data1) == splitByIndentation(data2) == splitByIndentation(data3))
# True

CodePudding user response：

Your code looks mostly correct. You are also populating currentBlock correctly. But your current conditions will not append the last currentBlock to the outputBlock.

You can test it yourself by just making some changes to your input to:

"""
def foo():
    bar()
    oof()
def lol():
    foo()
def foo():
    bar()
    oof()"""))

You can see that I have added one more block and it would give the following output:

[['def foo():', '    bar()', '    oof()'], ['def lol():', '    foo()']]

To solve it you can simply append your output list with currentBlock after the loop ends:

def splitByIndentation(data: str):
    indented = False
    prevLine = ''
    currentBlock = []
    output = []

    for line in data.split('\n'):
        if line.startswith('\t') or line.startswith('    '):
            if not indented:
                currentBlock = [prevLine]
                indented = True

            currentBlock.append(line)
        else:
            if indented:
                output.append(currentBlock)
            indented = False
        prevLine = line
    output.append(currentBlock)
    return output

This will append that remaining currentBlock.