Finding a random string in each line using regex in python-CodePudding

Input file in different lines:

some text here 45 [check {xyz}]
more text 12 [check {pqr[25]}]
some more text 56 [check {efg[4]}]
still more text 78 [check { jkl[12]}]
more text 123 [check abc]
more text 34 ghi

Trying to extract just the following:

xyz
pqr[25]
efg[4]
jkl[12]
abc
ghi

There are cases where there are curly braces, no curly braces, extra spaces etc. I've been adding too many if else statements but want a cleaner way of doing this with regex. The information to be extracted will always exist in the end of the line and will be preceded by [check for most cases except the last one.

CodePudding user response：

You can try this:

import re
s = """some text here 45 [check {xyz}]
more text 12 [check {pqr[25]}
some more text 56 [check {efg[4]}]
still more text 78 [check { jkl[12]}
more text 123 [check abc]
more text 34 ghi
"""

regex = r"regex = r"(?<=check )(?:{[^{}] }|[^\[\]] )|\w $""

print(re.findall(regex, s)) 
#['{xyz}', '{pqr[25]}', '{efg[4]}', '{ jkl[12]}', 'abc', 'ghi']

All you have to do now is clean the output.

Test regex here: https://regex101.com/r/gczjQo/1

Test python here:https://ideone.com/j8Qfby

EDIT: If the input is multiline then you can use the flag re.M

re.compile(r"(?<=check )(?:{[^{}] }|[^\[\]] )|\w $", re.M)
l = regex.findall(s)
print([x.strip('{} ') for x in l])
# ['xyz', 'pqr[25]', 'efg[4]', 'jkl[12]', 'abc', 'ghi']

CodePudding user response：

Modifying @anotherGatsby's regular expression pattern very slightly (to allow for that abc to potentially be, say, abc[32]), we can do the following ("blankpaper.txt" contains the contents of your file).

from re import search, sub

# slightly modified @anotherGatsby's pattern
pattern = r'(?<=check )(?:{[^{}] }|[^\[\]] (\[\d \])?)|\w $'

some_list = []
with open("blankpaper.txt") as f:
    # read file line-by-line
    for line in f:
        some_list.append(sub(r'[{} ]','',search(pattern, line)[0]))

print(some_list)

Output

['xyz', 'pqr[25]', 'efg[4]', 'jkl[12]', 'abc', 'ghi']

If abc were replaced by abc[32] above, the output would be

['xyz', 'pqr[25]', 'efg[4]', 'jkl[12]', 'abc[32]', 'ghi']

The above extracts the desired elements line-by-line.

Instead of line-by-line, we can do (with identical output)

import re
from re import sub

# slightly modified @anotherGatsby's pattern
pattern = r'(?<=check )(?:{[^{}] }|[^\[\]] (\[\d \])?)|\w $'

with open("blankpaper.txt") as f:
    # read entire file
    text = f.read()

m = re.finditer(pattern, text)
# if m nonempty (could do this check in line-by-line version too)
if m:
    some_list = [sub(r'[{} ]','', i.group()) for i in m]

print(some_list)

Hope this helps. Any questions/concerns please let me know.