I have an XML file like that:
data = '''<dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference>'''
I want to retrieve all length values. My code for doing this:
from bs4 import BeautifulSoup
xml_file = BeautifulSoup(data, 'lxml')
pdbs_xml = xml_file.find_all('dbreference', {'type': 'PDB'})
if len(pdbs_xml) != 0:
for item in pdbs_xml:
if item.find('property'):
id_ = item['id']
chains = item.find('property', {'type': 'chains'})
chains2 = chains['value']
if chains2.find(",")!= -1:
count = chains2.count(',')
if count >= 2:
chains = chains['value'].split('=')[count]
chains = chains.split(',')[0]
first_aa = chains.split('-')[0]
last_aa = chains.split('-')[1]
size_pdb = int(last_aa) - int(first_aa)
else:
chains = chains['value'].split('=')[2]
first_aa = chains.split('-')[0]
last_aa = chains.split('-')[1]
size_pdb = int(last_aa) - int(first_aa)
else:
chains = chains['value'].split('=')[1]
first_aa = chains.split('-')[0]
last_aa = chains.split('-')[1]
size_pdb = int(last_aa) - int(first_aa)
As you can see, some values are split. In theory, I can create a statement to predict every possibility and retrieve all cases (I know my code is not doing exactly that right now), but there is a better way to achieve that. So, any suggestion is welcome.
CodePudding user response:
Find all @id, @value
attributes under defined conditions in a single xpath, and process @value
on even positions in list.
Math is done directly over @value
with eval
and multiplying by -1
since the result will be negative. No need to split and swap.
Get @id
xpath part
//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/@id
Get @value
xpath part
//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/property[@type="chains"]/@value
from lxml import etree
tree = etree.parse('test.xml')
steps = tree.xpath('//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/@id | //dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/property[@type="chains"]/@value')
for i in range(len(steps)):
# @value appear on even positions
if (i%2) != 0:
items = steps[i].split(',')
s=0
for item in items:
values = item.split('=')
s =eval(values[1])*(-1)
print(steps[i-1],s)
Result:
6LVN 35
6LXT 122
6LXV 616
CodePudding user response:
Here you go [Note that the code below does not require any external library]
import xml.etree.ElementTree as ET
from collections import defaultdict
data = '''<r><dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference></r>'''
sizes = defaultdict(int)
root = ET.fromstring(data)
for ref in root.findall('.//dbReference'):
pdb = ref.attrib['id']
chains = ref.find('property[@type="chains"]')
value = chains.attrib['value']
parts = value.split(',')
for part in parts:
left,right = part.split('=')
_left,_right = right.split('-')
sizes[pdb] = int(_right)- int(_left)
print(sizes)
output
defaultdict(<class 'int'>, {'6LVN': 35, '6LXT': 122, '6LXV': 616})
CodePudding user response:
Used bs4
to find the values, a regex to get the intervals, then build-in functions to sum the differences of each interval.
data = '''<dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference>'''
from bs4 import BeautifulSoup
import re
xml_file = BeautifulSoup(data, 'lxml')
output = {}
for tag in xml_file.find_all(type="chains", value=True):
interval = re.findall(r'([0-9] -[0-9] )', tag['value'])
output[tag.parent['id']] = sum(map(lambda p: abs(int(p[1])-int(p[0])), (map(lambda p: p.split('-'), interval))))
print(output)
Output
{'6LVN': 35, '6LXT': 122, '6LXV': 616}