I have a string, resulting from some machine learning algorithm, which is generally formed by multiple lines. At the beginning and at the end there can be some lines not containing any characters (except for whitespaces), and in between there should be 2 lines, each containing a word followed by some numbers and (sometimes) other characters.
Something like this
first_word 3 5 7 @ 4
second_word 4 5 67| 5 [
I need to extract the 2 words and the numeric characters.
I can eliminate the empty lines by doing something like:
lines_list = initial_string.split("\n")
for line in lines_list:
if len(line) > 0 and not line.isspace():
print(line)
but now I was wondering:
- if there is a more robust, general way
- how to parse each of the remaining 2 central lines, by extracting the words and digits (and discard the other characters mixed in between the digits, if there are any)
I imagine reg expressions could be useful, but I never really used them, so I'm struggling a little bit at the moment
CodePudding user response:
I would use re.findall here:
inp = '''first_word 3 5 7 @ 4
second_word 4 5 67| 5 ['''
matches = re.findall(r'\w ', inp)
print(matches) # ['first_word', '3', '5', '7', '4', 'second_word', '4', '5', '67', '5']
If you want to process each line separately, then simply split in the input on CR?LF and use the same approach:
inp = '''first_word 3 5 7 @ 4
second_word 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'\w ', line)
print(matches)
This prints:
['first_word', '3', '5', '7', '4']
['second_word', '4', '5', '67', '5']