Home > Net >  How do I scrape data from this html tags using python and bs4
How do I scrape data from this html tags using python and bs4

Time:02-18

I am able to scrape data successfully, how I want to write this to a csv file (comma separated)

I want to extract the elements with class "ui-h2", "help__content help__content--small" and "exam-name". I have the code, but as an output I want

  • April 2022, 3 Apr 2022 , UPPSC ACF RFO Mains
  • April 2022, 3 Apr 2022 , MPSC Group C
<div  data-toggle="collapse" data-target="#exam-4" aria-expanded=false>
  <div >April 2022 <span >14 Exams</span></div>
</div>
<div  id="exam-4" data-parent="#exam-month">
  <div >
      <div >
          <div >
              <a href="https://testbook.com/uppsc-acf-rfo" >
                  <div>
                      <span ></span>
                      <span >3 Apr 2022</span>
                      <span >Official</span>
                  </div>
                  <div >
                      <span >
                      <img src="https://blogmedia.testbook.com/blog/wp-content/uploads/2020/06/uttar-pradesh-logo-png-8-5bbbec3b.png" height="30">
                      </span>
                      <span  title="UPPSC ACF RFO Mains">UPPSC ACF RFO Mains</span>
                      <span >
                      Know More <span ></span>
                      </span>
                  </div>
              </a>
          </div>
      </div>
      <div >
          <div >
              <a href="https://testbook.com/mpsc-group-c" >
                  <div>
                      <span ></span>
                      <span >3 Apr 2022</span>
                      <span >Official</span>
                  </div>
                  <div >
                      <span >
                      <img src="https://blogmedia.testbook.com/blog/wp-content/uploads/2020/03/mpsc-logo-1-44a80da2.png" height="30">
                      </span>
                      <span  title="MPSC Group C">MPSC Group C</span>
                      <span >
                      Know More <span ></span>
                      </span>
                  </div>
              </a>
          </div>
      </div>
</div>
</div>
for contents in soup.find_all("div", {"class":"ui-h2"}):
    #print(contents)
    if contents.text is not None:
            #print(contents.text)
            f.write(contents.text "-")

    for contentspan2 in soup.find_all("span", {"class":"help__content help__content--small"}):
        
        if contentspan2.string is not None:
            #print(contentspan2.string)
            f.write(contentspan2.string ",")
            

        for contentspan in soup.find_all("span", {"class":"exam-name"}):
        
            if contentspan.string is not None:
                #print(contentspan.string)
                f.write(contentspan.string "\n")

CodePudding user response:

Construct a list of the rows you want to write to file. Where each element in the list is a dictionary with key:value as the column name:value. Then let pandas do the work.

So given the html you provided:

soup = BeautifulSoup(html, 'html.parser')
rows = soup.find_all('div', {'class':'row'})

rowList = []
for row in rows:
    cards = row.find_all('div', {'class':re.compile("^ui-card")})
    for card in cards:
        dateStr = card.find('span',{'class':re.compile("^help__content")}).text.strip()
        examName = card.find('span', {'class':'exam-name'}).text
        rowList.append({'date':dateStr,
                        'exam':examName})

df = pd.DataFrame(rowList)
df.to_csv('filename.csv', index=False)

Output:

print(df)
         date                 exam
0  3 Apr 2022  UPPSC ACF RFO Mains
1  3 Apr 2022         MPSC Group C
  • Related