Home > other >  How to scrape nested divs with BeautifulSoup?
How to scrape nested divs with BeautifulSoup?

Time:07-11

I'm trying to scrape data from a website. First I authenticate and start the session. There is no problem in this part. But I would like to scrape my test questions. So there are 100 Questions in a test with a unique url, but only members can have access to.

with requests.session() as s:

s.post(loginURL, data=payLoad)
res = s.get(targetURL)

res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, "html.parser")
elems = soup.find_all("div", class_="Question-Container")
print(elems)

After I try to run this code, I didn't receive the data which I wanted. The output looks likes this

[<div >
<div >
<div >
<span><b>Question Id: </b></span><span ></span>
</div>
</div>
<div >
<div ></div>
</div>
<div  hideanswer="false"></div>
<div  hideanswer="false">
<button >Show Solution</button>
<div ></div>
<div ></div>
</div>
</div>]

Output which I want is the data inside those elements. The div trees looks like this. There are alot of divs, where is for questionID, there is one more div QuestionText but the question Text is inside . There are four options for each question, class=QuestionOptions and so on. I want to scrape all of them. Image attach for better clarity. Screenshot of nested divs

And this is how it looks in original website. Original Page to scrape

CodePudding user response:

for i in elems:
    gg = i.find_all('div')
print(gg)

CodePudding user response:

Assuming as you mentioned in the comments, all data / content is in your soup you could go with:

...
soup = bs4.BeautifulSoup(res.text, "html.parser")
data = []
for e in soup.select('.Question-Container'):
    d = {
        'question': e.select_one('.qText').text if e.select_one('.qText') else None
    }
    d.update(dict(s.stripped_strings for s in e.select('.answerText')))

    data.append(d)

df = pd.DataFrame(data)

Output would be something like that:

question A B
0 my question text answer text a answer text b

...

  • Related