How to parse this kind of long string efficiently in Python?-CodePudding

I have a long string like this:

[left-ctrl]bhbhbhblbhbhblbhblblbhbl[left-ctrl][left-ctrl]blbhblbhblbbjbjbjblblbhblbhbhblbk[left-ctrl][left-ctrl]bhblblbjbjbkbjbkbjbkbkbh[left-ctrl]kkkkkk[left-cmd][tab][left-cmd][del]su[del][del]cut [del][left-shift];[left-shift][del]s[left-shift];

The actual string is much longer (160,000 char). I want to treat [...] as a single char like b, h, ... . How?

Edited:

The problem is that a single [ and ] can appear like [[[[[, ]]]]]]. My current idea is to use some library to pre-find the occurrence point of control chars like [left-ctrl], [cmd], ... . Then use a cursor to loop-through it and take special care when the cursor is at these special point. But this idea might take multiple round to pre-find these special points. I'm thinking about whether there is a simpler way to do so efficiently. Regarding using library I'm just too lazy to implement the KMP algorithm myself.

Notice that it's possible a single ] without [ will appear. E.g. [left-ctrl]]hbh[left-ctrl].

CodePudding user response：

I'm not sure I completely understand, but you could try something like this:

import re

re_content = re.compile(r'\[ (.*?)\] |([^\[\]] )')

string = ...  # The string
for match in re_content.finditer(string):
    in_brackets, not_in_brackets = match.groups()
    if in_brackets:  # Do the special stuff
        print(f'In brackets: {in_brackets}')
    if not_in_brackets:  # Do the normal stuff
        print(f'Not in brackets: {not_in_brackets}')

Output for

string = 'abc[left-ctrl]lbhbl[left-ctrl]]bhblbk[left-ctrl]bkbh[left-ctrl]kkkkkk[left-cmd][tab][del]su[del][del]cut [del][left-shift];[left-shift][del]s[left-shift];[[[[[del]]]]]]'

Not in brackets: abc
In brackets: left-ctrl
Not in brackets: lbhbl
In brackets: left-ctrl
Not in brackets: bhblbk
In brackets: left-ctrl
Not in brackets: bkbh
In brackets: left-ctrl
Not in brackets: kkkkkk
In brackets: left-cmd
In brackets: tab
In brackets: del
Not in brackets: su
In brackets: del
In brackets: del
Not in brackets: cut 
In brackets: del
In brackets: left-shift
Not in brackets: ;
In brackets: left-shift
In brackets: del
Not in brackets: s
In brackets: left-shift
Not in brackets: ;
In brackets: del

CodePudding user response：

Option 1: Manual parsing

You could define an iterator function that accumulates the bracketed characters and yields the special keys as keywords when the matching brackets are found:

def keyCodes(iKeys):
    special = ""
    for c in iKeys:
        if c == "[":                        # start of special char
            if special: yield from special  # flush prev. individual chars
            special = c                     
        elif c == "]" and special:          # closing bracket
            special  = c 
            if len(special)>2: yield special       # special code
            else:              yield from special  # empty [], not a code
            special = ""                           # reset
        elif special:
            special  = c                           # accumulate special code
        else:
            yield c                                # return simple chars
    yield from special                             # flush trailing chars

Output:

keys = "[left-ctrl]bhbhbhblbhbhblbhblblbhbl[left-ctrl][left-ctrl]blbhblbhblbbjbjbjblblbhblbhbhblbk[left-ctrl][left-ctrl]bhblblbjbjbkbjbkbjbkbkbh[left-ctrl]kkkkkk[left-cmd][tab][left-cmd][del]su[del][del]cut [del][left-shift];[left-shift][del]s[left-shift];"
for code in keyCodes(keys):
    print(code)

[left-ctrl]
b
h
b
h
b
h
b
l
b
h
b
h
b
l
b
h
b
l
b
l
b
h
b
l
[left-ctrl]
[left-ctrl]
b
l
b
h
b
l
b
h
b
l
b
b
j
b
j
b
j
b
l
b
l
b
h
b
l
b
h
b
h
b
l
b
k
[left-ctrl]
[left-ctrl]
b
h
b
l
b
l
b
j
b
j
b
k
b
j
b
k
b
j
b
k
b
k
b
h
[left-ctrl]
k
k
k
k
k
k
[left-cmd]
[tab]
[left-cmd]
[del]
s
u
[del]
[del]
c
u
t
 
[del]
[left-shift]
;
[left-shift]
[del]
s
[left-shift]
;

Note that, the condition (if len(special)>2) to determine if a special code should be output as a string or as individual characters probably needs to check against a list of valid special key codes (e.g. if special in specialCodes) otherwise some key patterns may be returned as special codes when they are not (e.g. [xxx] or [@]).

Option 2: General regular expression pattern

If you don't mind using a library, the same result can be obtained using a regular expression:

for code in re.findall(r'\[[^\]\[] \]|.',keys):
    print(code)

The expression has 2 parts, searched in order of precedence (using the pipe (|) operator):

\[[^\]\[] \] : At least one character between brackets (excluding other brackets)
. : any single character

Like the previous solution, this may return invalid special codes for key sequences such as [abc]

Option 3: Specific regular expression pattern

If you do have a list of the valid special codes, you can build a regular expression to extract them specifically:

specialCodes = ['[tab]', '[left-ctrl]', '[left-shift]', 
                '[del]', '[left-cmd]']
keyCodes = re.compile("|".join(re.escape(c) for c in specialCodes) "|.")

for code in keyCodes.findall(keys):
    print(code)

The pattern is built using the pipe (|) operator to find the special codes first and ends with a catch all single character (.) for normal keystrokes.

Regular expressions are known to sometimes be slow so I would suggest comparing the performance of these options on your data if processing is time sensitive.