Input file in different lines:
some text here 45 [check {xyz}]
more text 12 [check {pqr[25]}]
some more text 56 [check {efg[4]}]
still more text 78 [check { jkl[12]}]
more text 123 [check abc]
more text 34 ghi
Trying to extract just the following:
xyz
pqr[25]
efg[4]
jkl[12]
abc
ghi
There are cases where there are curly braces, no curly braces, extra spaces etc. I've been adding too many if else statements but want a cleaner way of doing this with regex. The information to be extracted will always exist in the end of the line and will be preceded by [check for most cases except the last one.
CodePudding user response:
You can try this:
import re
s = """some text here 45 [check {xyz}]
more text 12 [check {pqr[25]}
some more text 56 [check {efg[4]}]
still more text 78 [check { jkl[12]}
more text 123 [check abc]
more text 34 ghi
"""
regex = r"regex = r"(?<=check )(?:{[^{}] }|[^\[\]] )|\w $""
print(re.findall(regex, s))
#['{xyz}', '{pqr[25]}', '{efg[4]}', '{ jkl[12]}', 'abc', 'ghi']
All you have to do now is clean the output.
Test regex here: https://regex101.com/r/gczjQo/1
Test python here:https://ideone.com/j8Qfby
EDIT:
If the input is multiline then you can use the flag re.M
re.compile(r"(?<=check )(?:{[^{}] }|[^\[\]] )|\w $", re.M)
l = regex.findall(s)
print([x.strip('{} ') for x in l])
# ['xyz', 'pqr[25]', 'efg[4]', 'jkl[12]', 'abc', 'ghi']
CodePudding user response:
Modifying @anotherGatsby's regular expression pattern very slightly (to allow for that abc
to potentially be, say, abc[32]
), we can do the following ("blankpaper.txt"
contains the contents of your file).
from re import search, sub
# slightly modified @anotherGatsby's pattern
pattern = r'(?<=check )(?:{[^{}] }|[^\[\]] (\[\d \])?)|\w $'
some_list = []
with open("blankpaper.txt") as f:
# read file line-by-line
for line in f:
some_list.append(sub(r'[{} ]','',search(pattern, line)[0]))
print(some_list)
Output
['xyz', 'pqr[25]', 'efg[4]', 'jkl[12]', 'abc', 'ghi']
If abc
were replaced by abc[32]
above, the output would be
['xyz', 'pqr[25]', 'efg[4]', 'jkl[12]', 'abc[32]', 'ghi']
The above extracts the desired elements line-by-line.
Instead of line-by-line, we can do (with identical output)
import re
from re import sub
# slightly modified @anotherGatsby's pattern
pattern = r'(?<=check )(?:{[^{}] }|[^\[\]] (\[\d \])?)|\w $'
with open("blankpaper.txt") as f:
# read entire file
text = f.read()
m = re.finditer(pattern, text)
# if m nonempty (could do this check in line-by-line version too)
if m:
some_list = [sub(r'[{} ]','', i.group()) for i in m]
print(some_list)
Hope this helps. Any questions/concerns please let me know.