My XML looks like this:
<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" new_id="12">
<level>1&1</level>
<typ>Green</typ>
<name>Alpha</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
<offer id="12" new_id="31">
<level>1&1</level>
<typ>Yellow</typ>
<name>Beta</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
</details>
</main_heading>
I want to parse certain fields into a dataframe.
Expected Output
timestamp id new_id level name
20220113 11 12 1&1 Alpha
20220113 12 31 1&1 Beta
where NAME nested within the "visits" tag is not included. I just want to consider the outer "name" tag.
timestamp = soup.find('main_heading').get('timestamp')
df[timestamp'] = timestamp
this solves one part
The rest I can do like this:
typ = []
for i in (soup.find_all('typ')):
typ.append(i.text)
but i don't want to create several for loops for every new field
CodePudding user response:
Iterate over the offers and select its previous main_heading
:
for e in soup.select('offer'):
data.append({
'timestamp': e.find_previous('main_heading').get('timestamp'),
'id':e.get('id'),
'id_old':e.get('old_id'),
'level':e.level.text,
'typ':e.typ.text,
'name':e.select_one('name').text
})
Or in alternative to exclude only some elements and be more generic:
for e in soup.select('offer'):
d = {
'timestamp': e.find_previous('main_heading').get('timestamp'),
'id':e.get('id'),
'id_old':e.get('old_id'),
}
d.update({c.name:c.text for c in e.children if c.name is not None and 'visits' not in c.name})
data.append(d)
Example
from bs4 import BeautifulSoup
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" new_id="12">
<level>1&1</level>
<typ>Green</typ>
<name>Alpha</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
<offer id="12" new_id="31">
<level>1&1</level>
<typ>Yellow</typ>
<name>Beta</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
</details>
</main_heading>
'''
soup = BeautifulSoup(xml,'xml')
data = []
for e in soup.select('offer'):
data.append({
'timestamp': e.find_previous('main_heading').get('timestamp'),
'id':e.get('id'),
'id_old':e.get('old_id'),
'level':e.level.text,
'typ':e.typ.text,
'name':e.select_one('name').text
})
pd.DataFrame(data)
Output
timestamp | id | id_old | level | typ | name | |
---|---|---|---|---|---|---|
0 | 20220113 | 11 | 1&1 | Green | Alpha | |
1 | 20220113 | 12 | 1&1 | Yellow | Beta |
CodePudding user response:
pandas has .read_xml()
You can use xpath=
to pass custom XPath expressions to specify what to extract.
For example, <offer>
and <main_heading>
tags:
>>> pd.read_xml("main.xml", xpath="""//*[name() = "offer" or name() = "main_heading"]""")
timestamp details id new_id level typ name visits
0 20220113.0 NaN NaN NaN None None None NaN
1 NaN NaN 11.0 12.0 1&1 Green Alpha NaN
2 NaN NaN 12.0 31.0 1&1 Yellow Beta NaN
From there you could .ffill()
the timestamp and drop the details/visits columns:
>>> (pd.read_xml("main.xml", xpath="""//*[name() = "offer" or name() = "main_heading"]""")
... .ffill()
... .drop(columns=["details", "visits"])
... .dropna()
... )
timestamp id new_id level typ name
1 20220113.0 11.0 12.0 1&1 Green Alpha
2 20220113.0 12.0 31.0 1&1 Yellow Beta
CodePudding user response:
No need for any external library.
Core python is enough here.
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
<offer id="11" new_id="12">
<level>1&1</level>
<typ>Green</typ>
<name>Alpha</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
<offer id="12" new_id="31">
<level>1&1</level>
<typ>Yellow</typ>
<name>Beta</name>
<visits>
<name>DONT INCLUDE</name>
</visits>
</offer>
</details>
</main_heading>'''
data = []
root = ET.fromstring(xml)
timestamp = root.attrib.get('timestamp')
for offer in root.findall('.//offer'):
temp = {'timestamp': timestamp}
for attr in ['id', 'new_id']:
temp[attr] = offer.attrib.get(attr)
for ele in ['level', 'name']:
temp[ele] = offer.find(ele).text
data.append(temp)
df = pd.DataFrame(data)
print(df)
output
timestamp id new_id level name
0 20220113 11 12 1&1 Alpha
1 20220113 12 31 1&1 Beta
CodePudding user response:
For the sake of completeness (and future visitors) here's another one: since we're dealing with xml and the final output is a dataframe - it's probably best (and simplest) to use pandas.read_xml:
df = pd.read_xml(xml,xpath='//offer')
ts = pd.read_xml(xml,xpath="//main_heading")['timestamp'][0]
df.insert(0, 'timestamp', ts)
print(df.drop(['typ', 'visits'], axis=1))
And that should get you your expected output.