I have a string of a chemical formula: 'Ba1 Cl2 O8'
. I need to parse the string into two seperate tuples like the following:
comp = ('Ba','Cl','O')
count = (1,2,8)
The code needs to work for any formula with arbitrary length and with a count of more than one digit eg. 'H2 O1'
, 'Ba11 Sn7 O16'
CodePudding user response:
I would use regex for that
import re
text = 'Ba2 Sn1 O16'
comp = re.findall(r"[A-Z][a-z]*", text)
nums = re.findall(r"[0-9] ", text)
count = [int(x) for x in nums]
print(tuple(comp ))
print(tuple(count))
output:
('Ba', 'Sn', 'O')
(2, 1, 16)
CodePudding user response:
Given this example 'H2 O1', 'Ba11 Sn7 O16' you have several layers of the onion to peal. Assuming this is read a single line of text. Watch out for trailing newlines use rstrip()
The item separated at commas, Use line.split(',') and strip to get rid of blanks. This will give you a list of quoted items. Iterating on this list you have to remove the quotes. A slice of elements[1,-1] a This is optional but leaving it out would make later steps more complex.
Now you have to split on the spaces between elements.
Finally you have a list of elements which have the pattern of 1 or 2 letters followed by a number. This can be done with the regular expression functions. This is a topic of study on its own.
The alternative at this step is to process the element character by character. Simple concept but could be messy code.
Text manipulation/parsing can be messy because you have to account for imperfect inputs. There is a lot of attention to details like how many space are entered into the string and are symbols used multiple times.
If this is not oblivious there are 3 level of nesting in these loops. 1 Split on commas 2 remove quotes and split on spaces 3 Split on the pattern. It would be good to perfect each level before you go to the inner ones.
Ohad Sharet answer shows a regular expression that would handle the 2 inner loops.