Parsing a string 'Ba1 Cl2 O8' to get two separate tuples of ints and strings-CodePudding

I have a string of a chemical formula: 'Ba1 Cl2 O8'. I need to parse the string into two seperate tuples like the following:

comp = ('Ba','Cl','O')
count = (1,2,8)

The code needs to work for any formula with arbitrary length and with a count of more than one digit eg. 'H2 O1', 'Ba11 Sn7 O16'

CodePudding user response：

I would use regex for that

import re
text = 'Ba2 Sn1 O16'
comp  = re.findall(r"[A-Z][a-z]*", text)
nums = re.findall(r"[0-9] ", text)
count = [int(x) for x in nums]
print(tuple(comp ))
print(tuple(count))

output:

('Ba', 'Sn', 'O')
(2, 1, 16)

CodePudding user response：

Given this example 'H2 O1', 'Ba11 Sn7 O16' you have several layers of the onion to peal. Assuming this is read a single line of text. Watch out for trailing newlines use rstrip()

The item separated at commas, Use line.split(',') and strip to get rid of blanks. This will give you a list of quoted items. Iterating on this list you have to remove the quotes. A slice of elements[1,-1] a This is optional but leaving it out would make later steps more complex.

Now you have to split on the spaces between elements.

Finally you have a list of elements which have the pattern of 1 or 2 letters followed by a number. This can be done with the regular expression functions. This is a topic of study on its own.

The alternative at this step is to process the element character by character. Simple concept but could be messy code.

Text manipulation/parsing can be messy because you have to account for imperfect inputs. There is a lot of attention to details like how many space are entered into the string and are symbols used multiple times.

If this is not oblivious there are 3 level of nesting in these loops. 1 Split on commas 2 remove quotes and split on spaces 3 Split on the pattern. It would be good to perfect each level before you go to the inner ones.

Ohad Sharet answer shows a regular expression that would handle the 2 inner loops.