How to get each element from a given molecular composition-CodePudding

Given a molecule of the form "FA0.85MA0.15Pb(I0.85Br0.15)3". I need a dictionary with the result ("FA":"0.85", "MA":"0.15", "Pb":"1.00", "I":"2.55", "Br":"0.45").

This is obtained by multiplying the numbers (integers and floats) inside a parantheses with an integer just outside it and then separating each of the pairs. It also includes giving the value to the elements that have no compositional value mentioned (such as Pb here) should be given the value "1.00".

How can I do it with Python?

I have a bunch of such molecules and I want to create a dataframe containing all these elements and their compositions for elemental analysis.

Note: There will not be any nested brackets. All the compositions will be similar in structure to the example that I have shown above, e.g. "Cs0.05(MA0.17FA0.83)0.95Pb(I0.83Br0.17)3", "CH3NH3PbI3", "(CH3NH3)3Bi2I9".

CodePudding user response：

with regex, we can find and split the items. here are two functions to do that, the first function finds the brackets with their coefficient and the second function finds the molecules with their coefficient from the string.

so remove the brackets from the string and find and add the molecules (without brackets). then do the same work for strings that are in brackets, then multiple the molecule's coefficient in bracket coefficient, and finally add these new molecules to total molecules.

import re

def find_molecules(mol):
    # find moleculars
    mols = re.findall(r'([a-zA-Z] )(\d \.?\d |\d )?', mol, re.I)
    # fix the non-number molecules
    for i in range(len(mols)) :
        mol = mols[i]
        if not mol[1] :
            mols[i] = (mol[0], '1')
    # convert to numeric
    mols = [(mol[0], float(mol[1])) for mol in mols]
    
    return mols

def find_brackets(mol):
    brks = []
    # find and split brackets
    for br in re.finditer(r'[\(](. )[\)](\d \.?\d |\d )', mol, re.I):
        # remove bracker from string
        mol = mol.replace(br[0], '')
        brks.append( (br[1], float(br[2])) )
        
    return brks, mol

def main():
    molecular = "FA0.85MA0.15Pb(I0.85Br0.15)3"
    data = []
    # find and split the the brackets
    brks, mol = find_brackets(molecular)
    # add the main molecules
    data.extend(find_molecules(mol))
    # add the molecules that are in brackets
    for br in brks:
        # get molecules in bracket
        mols = find_molecules(br[0])
        # multiple the coefficient
        mols = [(mol[0], br[1] * mol[1]) for mol in mols]
        # add the mols
        data.extend(mols)
    
    print(data)

main()

CodePudding user response：

The implementation first parses the string into tokens. A small-state machine deals with missing value token and brackets&multiplier.

import re

s = "FA0.85MA0.15Pb(I0.85Br0.15)3"
# t0, t1, token type states
t0 = None; token = ''; token_list = []
for c in s ' ':
    if re.match('^[a-zA-Z_] $',c): t1 = "N"
    elif re.match('^[0-9.] $',c):  t1 = "V"
    elif re.match('^[\(] $',c):    t1 = "O"
    elif re.match('^[\)] $',c):    t1 = "C"
    else: t1 = 'E'
        
    if not t1 == t0:
        # a token is complete
        if t0 == 'V': token = float(token)
            
        token_list.append([t0,token])
        
        if t0 == 'N' and not t1 == 'V':  # insert value token of 1.0
            token_list.append(['V',1.0])
        
        if t0 == 'V' and token_list[-2][0] == 'C':
            mult = token_list.pop()
            l = len(token_list)
            for i in range(0, l):
                if token_list[l-i-1][0] == 'O': break # muliplier scope terminates
                elif token_list[l-i-1][0] == 'V':
                    token_list[l-i-1][1] *= mult[1]
        t0 = t1; token = ''
      
    token = token   c
    
# clean list
token_list_final = []
for item in token_list[1:]:
    if item[0] == 'N' or item[0] == 'V':
        token_list_final.append(item)
token_list_final

The above gives cleaned up token list:

This can be further processed into a data frame using:

V = [ item[1] for item in token_list_final[1::2]]
N = [ item[1] for item in token_list_final[ ::2]]
df = pd.DataFrame([N, V]).T

Giving: