Retrieve splitted values in a XML file-CodePudding

I have an XML file like that:

data = '''<dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference>'''

I want to retrieve all length values. My code for doing this:

from bs4 import BeautifulSoup

xml_file = BeautifulSoup(data, 'lxml')
pdbs_xml = xml_file.find_all('dbreference', {'type': 'PDB'})
if len(pdbs_xml) != 0:
    for item in pdbs_xml:
        if item.find('property'):
            id_ = item['id']
            chains = item.find('property', {'type': 'chains'})
            chains2 = chains['value']
            if chains2.find(",")!= -1:
                count = chains2.count(',')
                if count >= 2:
                    chains = chains['value'].split('=')[count]
                    chains = chains.split(',')[0]
                    first_aa = chains.split('-')[0]
                    last_aa = chains.split('-')[1]
                    size_pdb = int(last_aa) - int(first_aa)
                else:
                    chains = chains['value'].split('=')[2]
                    first_aa = chains.split('-')[0]
                    last_aa = chains.split('-')[1]
                    size_pdb = int(last_aa) - int(first_aa)
            else:
                chains = chains['value'].split('=')[1]
                first_aa = chains.split('-')[0]
                last_aa = chains.split('-')[1]
                size_pdb = int(last_aa) - int(first_aa)

As you can see, some values are split. In theory, I can create a statement to predict every possibility and retrieve all cases (I know my code is not doing exactly that right now), but there is a better way to achieve that. So, any suggestion is welcome.

CodePudding user response：

Find all @id, @value attributes under defined conditions in a single xpath, and process @value on even positions in list. Math is done directly over @value with eval and multiplying by -1 since the result will be negative. No need to split and swap.

Get @id xpath part
//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/@id

Get @value xpath part
//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/property[@type="chains"]/@value

from lxml import etree
tree = etree.parse('test.xml')

steps = tree.xpath('//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/@id | //dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/property[@type="chains"]/@value')

for i in range(len(steps)):
    # @value appear on even positions
    if (i%2) != 0:
        items = steps[i].split(',')
        s=0
        for item in items:
            values = item.split('=')
            s =eval(values[1])*(-1)
            
        print(steps[i-1],s)

Result:

6LVN 35
6LXT 122
6LXV 616

CodePudding user response：

Here you go [Note that the code below does not require any external library]

import xml.etree.ElementTree as ET
from collections import defaultdict

data = '''<r><dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference></r>'''

sizes = defaultdict(int)
root = ET.fromstring(data)
for ref in root.findall('.//dbReference'):
    pdb = ref.attrib['id']
    chains = ref.find('property[@type="chains"]')
    value = chains.attrib['value']
    parts = value.split(',')
    for part in parts:
        left,right = part.split('=')
        _left,_right = right.split('-')
        sizes[pdb]  = int(_right)- int(_left)

print(sizes)

output

defaultdict(<class 'int'>, {'6LVN': 35, '6LXT': 122, '6LXV': 616})

CodePudding user response：

Used bs4 to find the values, a regex to get the intervals, then build-in functions to sum the differences of each interval.

data = '''<dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference>'''

from bs4 import BeautifulSoup
import re

xml_file = BeautifulSoup(data, 'lxml')

output = {}
for tag in xml_file.find_all(type="chains", value=True):
    interval = re.findall(r'([0-9] -[0-9] )', tag['value'])
    output[tag.parent['id']] = sum(map(lambda p: abs(int(p[1])-int(p[0])), (map(lambda p: p.split('-'), interval))))

print(output)

Output

{'6LVN': 35, '6LXT': 122, '6LXV': 616}