I'm trying to scrape data from a website. First I authenticate and start the session. There is no problem in this part. But I would like to scrape my test questions. So there are 100 Questions in a test with a unique url, but only members can have access to.
with requests.session() as s:
s.post(loginURL, data=payLoad)
res = s.get(targetURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
elems = soup.find_all("div", class_="Question-Container")
print(elems)
After I try to run this code, I didn't receive the data which I wanted. The output looks likes this
[<div >
<div >
<div >
<span><b>Question Id: </b></span><span ></span>
</div>
</div>
<div >
<div ></div>
</div>
<div hideanswer="false"></div>
<div hideanswer="false">
<button >Show Solution</button>
<div ></div>
<div ></div>
</div>
</div>]
Output which I want is the data inside those elements. The div trees looks like this. There are alot of divs, where is for questionID, there is one more div QuestionText but the question Text is inside . There are four options for each question, class=QuestionOptions and so on. I want to scrape all of them. Image attach for better clarity. Screenshot of nested divs
And this is how it looks in original website. Original Page to scrape
CodePudding user response:
for i in elems:
gg = i.find_all('div')
print(gg)
CodePudding user response:
Assuming as you mentioned in the comments, all data / content is in your soup
you could go with:
...
soup = bs4.BeautifulSoup(res.text, "html.parser")
data = []
for e in soup.select('.Question-Container'):
d = {
'question': e.select_one('.qText').text if e.select_one('.qText') else None
}
d.update(dict(s.stripped_strings for s in e.select('.answerText')))
data.append(d)
df = pd.DataFrame(data)
Output would be something like that:
question | A | B | |
---|---|---|---|
0 | my question text | answer text a | answer text b |
...