This is a follow-up to a previous question of mine, I identified the problem more clearly and I would need some further suggestions :)
I have a string, resulting from some machine learning algorithm, which generally has the following structure:
- at the beginning and at the end, there can be some lines not containing any characters (except for whitespaces);
- in between, there should be 2 lines, each containing a name (either only the surname, or name and surname, or the initial letter from the name plus the surname...), followed by some numbers and (sometimes) other characters mixed in between the numbers;
- one of the names is generally preceded by a special, non-alphanumeric character (>, >>, @, ...).
Something like this:
Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 [
I need to extract the 2 names and the numeric characters, and check if one of the lines starts with the special character, so my output should be:.
name_01 = 'Connery'
digits_01 = [3, 5, 7, 4]
name_02 = 'R. Moore'
digits_02 = [4, 5, 67, 5]
selected_line = 2 (anything indicating that it's the second line)
In the linked original question, I've been suggested to use:
inp = '''Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'\w ', line)
print(matches)
which produces a result pretty close to what I want:
['Connery', '3', '5', '7', '4']
['R', 'Moore', '4', '5', '67', '5']
But I would need the first two strings in the second line ('R', 'Moore') to be grouped together (basically, group together all the characters before the digits begin). And, it skips the detection of the special character. Should I somehow fix this output, or can I tackle the problem in a different way altogether?
CodePudding user response:
I am not sure which characters you expect, want to keep or remove, but something like the following should work for the example:
inp = '''Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s] [a-zA-Z.])|\w ', line)
print(matches)
output:
['Connery', '3', '5', '7', '4']
['R. Moore', '4', '5', '67', '5']
NB. I included a-z
(lower and upper) and dot, with optional spaces in the middle: [a-zA-Z.][a-zA-Z.\s] [a-zA-Z.]
, but you should update to your real need.
CodePudding user response:
This would also include the special characters (keep in mind that they are hardcoded, so you have to add missing ones to the regex part [>@]
)
for line in lines:
matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s] [a-zA-Z.])|\w |[>@] ', line)
print(matches)
CodePudding user response:
This is better done in several steps.
# get the whitespace at start and end out
lines = inp.strip().split('\n')
for line in lines:
# for each line, identify the selection mark, the name, and the mess at the end
# assuming names can't have numbers in them
match = re.match(r'^(\W )?([^\d] ?)\s*([^a-zA-Z] )$', line.strip())
if match:
selected_raw, name, numbers_raw = match.groups()
# now parse the unprocessed bits
selected = selected_raw is not None
numbers = re.findall(r'\d ', numbers_raw)
print(selected, name, numbers)
# output
False Connery ['3', '5', '7', '4']
True R. Moore ['4', '5', '67', '5']