I am working with XML files and I want to have access in the author id and text of this file. I implement the following code but I am not sure for some lines.
for filepath in dataset_filepaths:
with open(filepath) as f:
soup = BeautifulSoup(f, "lxml")
body = soup.body
# To choose the author id !!! Is not working
author = body.select('author_id')[0].get_text().strip()
# To choose the text
soup_text = body.select('div')[0]
I have the following questions
- How can I choose the author id?
- To choose the text should I use 'div' or 'ab'?
- How can I choose the title of a text?
Thank you so much for you help!
Here you are the text
<div author_id="0093" work_id="006" work_short="Metaph">
<ab l0="0093" l1="006" l2="Metaph" l8="4a" l9="1t">
<lb n="1t"/>
<title>ΘΕΟΦΡΑΣΤΟΥ ΤΩΝ ΜΕΤΑ ΤΑ ΦΥΣΙΚΑ</title>
<lb n="2"/>
Πῶσ ἀφορίσαι δεῖ καὶ ποίοισ τὴν ὑπὲρ τῶν
<lb n="3"/>
πρώτων θεωρίαν; ἡ γὰρ δὴ τῆσ φύσεωσ πολυ‐
<lb n="4"/>
χουστέρα, καὶ ὥσ γε δή τινέσ φασιν, ἀτακτοτέρα,
<lb n="5"/>
A valid XML below
<?xml version="1.0" encoding="UTF-8"?>
<div author_id="0093" work_id="006" work_short="Metaph">
<ab l0="0093" l1="006" l2="Metaph" l8="4a" l9="1t">
<lb n="1t" />
<title>ΘΕΟΦΡΑΣΤΟΥ ΤΩΝ ΜΕΤΑ ΤΑ ΦΥΣΙΚΑ</title>
<lb n="2" />
Πῶσ ἀφορίσαι δεῖ καὶ ποίοισ τὴν ὑπὲρ τῶν
<lb n="3" />
πρώτων θεωρίαν; ἡ γὰρ δὴ τῆσ φύσεωσ πολυ‐
<lb n="4" />
χουστέρα, καὶ ὥσ γε δή τινέσ φασιν, ἀτακτοτέρα,
<lb n="5" />
</ab>
</div>
CodePudding user response:
You can do this way.
Since you need
author_id
, select the<div>
. Asauthor_id
is an attribute of tag, you can extract it like thisd = soup.find('div') author_id = d['author_id']
For title, select the
<title>
tag using.find()
and print it's text
from bs4 import BeautifulSoup
s = """<?xml version="1.0" encoding="UTF-8"?>
<div author_id="0093" work_id="006" work_short="Metaph">
<ab l0="0093" l1="006" l2="Metaph" l8="4a" l9="1t">
<lb n="1t" />
<title>ΘΕΟΦΡΑΣΤΟΥ ΤΩΝ ΜΕΤΑ ΤΑ ΦΥΣΙΚΑ</title>
<lb n="2" />
Πῶσ ἀφορίσαι δεῖ καὶ ποίοισ τὴν ὑπὲρ τῶν
<lb n="3" />
πρώτων θεωρίαν; ἡ γὰρ δὴ τῆσ φύσεωσ πολυ‐
<lb n="4" />
χουστέρα, καὶ ὥσ γε δή τινέσ φασιν, ἀτακτοτέρα,
<lb n="5" />
</ab>
</div>"""
soup = BeautifulSoup(s, 'xml')
d = soup.find('div')
author_id = d['author_id']
title = d.find('title').text
txt = list(d.ab.stripped_strings)[1:]
print(f'Author_ID: {author_id}\nTitle: {title}\nText: {txt}')
Author_ID: 0093
Title: ΘΕΟΦΡΑΣΤΟΥ ΤΩΝ ΜΕΤΑ ΤΑ ΦΥΣΙΚΑ
Text: ['Πῶσ ἀφορίσαι δεῖ καὶ ποίοισ τὴν ὑπὲρ τῶν', 'πρώτων θεωρίαν; ἡ γὰρ δὴ τῆσ φύσεωσ πολυ‐', 'χουστέρα, καὶ ὥσ γε δή τινέσ φασιν, ἀτακτοτέρα,']
CodePudding user response:
Something like the below
from bs4 import BeautifulSoup
from bs4.element import NavigableString
html = '''<div author_id="0093" work_id="006" work_short="Metaph">
<ab l0="0093" l1="006" l2="Metaph" l8="4a" l9="1t">
<lb n="1t"/>
<title>ΘΕΟΦΡΑΣΤΟΥ ΤΩΝ ΜΕΤΑ ΤΑ ΦΥΣΙΚΑ</title>
<lb n="2"/>
Πῶσ ἀφορίσαι δεῖ καὶ ποίοισ τὴν ὑπὲρ τῶν
<lb n="3"/>
πρώτων θεωρίαν; ἡ γὰρ δὴ τῆσ φύσεωσ πολυ‐
<lb n="4"/>
χουστέρα, καὶ ὥσ γε δή τινέσ φασιν, ἀτακτοτέρα,
<lb n="5"/>'''
soup = BeautifulSoup(html, 'html.parser')
author_id = soup.div.attrs['author_id']
print('------------------')
print(f'author_id: {author_id}')
print('------------------')
title = soup.div.title.text
print(f'title: {title}')
print('------------------')
text = []
for child in soup.div.ab.children:
if isinstance(child,NavigableString):
if child.text.strip():
text.append(child.text.strip())
print(f'text: {text}')
output
------------------
author_id: 0093
------------------
title: ΘΕΟΦΡΑΣΤΟΥ ΤΩΝ ΜΕΤΑ ΤΑ ΦΥΣΙΚΑ
------------------
text: ['Πῶσ ἀφορίσαι δεῖ καὶ ποίοισ τὴν ὑπὲρ τῶν', 'πρώτων θεωρίαν; ἡ γὰρ δὴ τῆσ φύσεωσ πολυ‐', 'χουστέρα, καὶ ὥσ γε δή τινέσ φασιν, ἀτακτοτέρα,']