Home > Back-end >  How to get only necessary <div> with BeautifulSoup
How to get only necessary <div> with BeautifulSoup

Time:07-08

I need to parse some information about writers with BeautifulSoup from wikidata.

Page: https://www.wikidata.org/wiki/Q39829

Problem

I need to parse the field "child" from the page. In the result I want to get 3 names. But instead of 3 names, I got 3 names 2 extra values.

Code

children_html = soup.find('div', id='P40').find_all('div', class_='wikibase-snakview-variation-valuesnak')
children_list = [child.text.strip() for child in children_html]
print(children_list)

The result is:

['Joe Hill', 'Owen King', 'Naomi King', 'https://books.google.de/books?id=aPBbAgAAQBAJ', '81']

Question

Is there any way to get only name in the result:

['Joe Hill', 'Owen King', 'Naomi King']

The code should also work for other writers pages. Who could have less or more children

CodePudding user response:

You are so close to your goal - simply change the class to wikibase-statementview-mainsnak that is more specific:

soup.find('div', id='P40').find_all('div', class_='wikibase-statementview-mainsnak')

As alternative you could use css selectors for short hand:

soup.select('#P40 .wikibase-statementview-mainsnak')

Both will give you:

['Joe Hill', 'Owen King', 'Naomi King']

Be aware To avoid running into NoneType errors, you should always check if elements exists

if soup.find('div', id='P40'):
    children_html = soup.find('div', id='P40').find_all('div', class_='wikibase-statementview-mainsnak')
    children_list = [child.text.strip() for child in children_html]
    print(children_list)
else:
    children_list = []
    print('no child found')

or in one line that is genarating an empty list in case there ar no children:

children_list = [child.text.strip() for child in soup.select('#P40 .wikibase-statementview-mainsnak')]
  • Related