So I'm having a weird problem. I'm using BeautifulSoup
to scrape dictionary websites for definitions and their parts of speech and they have to be scraped in the right order so the correct part of speech goes with the correct definition.
For example, for 'ape' the definition 'A large primate' has to go with noun and 'mimic' has to go with verb. For Merriam Webster's site I used:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.merriam-webster.com/dictionary/'
word = 'ape'
results = requests.get(url word)
src = results.content
soup = bs(src, 'lxml')
text = soup.find_all(class_= ['num', 'letter', 'dtText', 'sdsense', 'important-blue-link'])
for tag in text:
print(tag.text.strip())
This worked great. For each div with class = 'num', 'letter', etc... it stripped the correct elements and then print(tag.text.strip())
returned the text inside.
Unfortunately, MW's formatting is a nightmare (notice there are way more class tags than just part of speech and definition) and the definitions are wordier than what I'm looking for, so I went to dictionary.com. Dictionary.com has way simpler formatting and better definitions for my purposes, so I was happy. The problem happens when I try to pass multiple classes into the find_all function. If I run:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.dictionary.com/browse/'
word = 'ape'
results = requests.get(url word, verify = False)
src = results.content
soup = bs(src, 'lxml')
text = soup.find_all(class_ = 'one-click-content css-nnyc96 e1q3nk1v1')
for tag in text:
print(tag.text.strip())
I get all the definitions fine, and if I run the same code with
text = soup.find_all(class_ = 'luna-pos')
I get all the parts of speech fine, but if I run the code with
text = soup.find_all(class_ = ['luna-pos','one-click-content css-nnyc96 e1q3nk1v1'])
it returns the text variable as just an empty list. I don't understand why this format for entering multiple tags into the find_all()
function works for one website, but not the other. The only thing I can think is requests.get()
isn't finding dictionary.com's certificates, so I entered verify = False and it returns a little warning, but I can't think why that would affect the find_all()
function.
CodePudding user response:
Not sure why there is a need to combine these two parts, but you can get your goal with:
soup.find_all(class_ = ['luna-pos','one-click-content'])
or
soup.select('.luna-pos,.one-click-content')
Just in case - Getting a separated and more structured output you should change strategy selecting your elements:
data = []
for e in soup.select('#top-definitions-section ~ section'):
data.append({
'pos':e.select_one('.luna-pos').text,
'definition':[t.get_text(strip=True) for t in e.select('div[value]')]
})
data
Output:
[{'pos': 'noun',
'definition': ['Anthropology,Zoology.any member of the superfamily Hominoidea, the two extant branches of which are the lesser apes (gibbons) and the great apes (humans, chimpanzees, gorillas, and orangutans).See alsocatarrhine.',
'(loosely) any primate except humans.',
'an imitator;mimic.',
'Informal.a big, ugly, clumsy person.',
'Disparaging and Offensive.(used as a slur against a member of a racial or ethnic minority group, especially a Black person.)']},
{'pos': 'verb (used with object),',
'definition': ["toimitate;mimic:to ape another's style of writing."]},
{'pos': 'adjective',
'definition': ['Slang. (usually in the phrasego ape)violently emotional:When she threatened to leave him, he went ape.extremely enthusiastic (often followed byoverorfor):They go ape over old rock music.We were all ape for the new movie trailer.']}]