I need to parse some information about writers with BeautifulSoup
from wikidata.
Page: https://www.wikidata.org/wiki/Q39829
Problem
I need to parse the field "child" from the page. In the result I want to get 3 names. But instead of 3 names, I got 3 names 2 extra values.
Code
children_html = soup.find('div', id='P40').find_all('div', class_='wikibase-snakview-variation-valuesnak')
children_list = [child.text.strip() for child in children_html]
print(children_list)
The result is:
['Joe Hill', 'Owen King', 'Naomi King', 'https://books.google.de/books?id=aPBbAgAAQBAJ', '81']
Question
Is there any way to get only name in the result:
['Joe Hill', 'Owen King', 'Naomi King']
The code should also work for other writers pages. Who could have less or more children
CodePudding user response:
You are so close to your goal - simply change the class to wikibase-statementview-mainsnak
that is more specific:
soup.find('div', id='P40').find_all('div', class_='wikibase-statementview-mainsnak')
As alternative you could use css selectors
for short hand:
soup.select('#P40 .wikibase-statementview-mainsnak')
Both will give you:
['Joe Hill', 'Owen King', 'Naomi King']
Be aware To avoid running into NoneType
errors, you should always check if elements exists
if soup.find('div', id='P40'):
children_html = soup.find('div', id='P40').find_all('div', class_='wikibase-statementview-mainsnak')
children_list = [child.text.strip() for child in children_html]
print(children_list)
else:
children_list = []
print('no child found')
or in one line that is genarating an empty list in case there ar no children:
children_list = [child.text.strip() for child in soup.select('#P40 .wikibase-statementview-mainsnak')]