I want to scrape names, positions, and type (online/in person) from elements like these
<div data-user="178">
<div >
<div >
<img src="https://secure.gravatar.com/avatar/99574b52aaa5ecb0bea650602fecfbd7?s=100&d=mm&r=g" alt="Dina Abdelma">
</div>
</div>
<div >
<div >Dina Abdelma</div>
<div >Head of SMEs, MDI</div>
<div >Online</div>
</div>
<div >
<div ></div>
<a >
Message </a>
<a href="#" >Schedule Meeting</a> </div>
</div>
I got to the login page, but I cannot scrape all of the data, I only got the first letter in one name.
Also, the number in data-user is always random, nothing else changes I want to scrape data from those three elements and put them into an array/excel.
<div >Dina Abdelma</div>
<div >Head of SMEs, MDI</div>
<div >Online</div>
This is my current code to log in to the webpage (unrelated, it works)
await page.waitForSelector('#username')
await page.type('#username', login)
await page.type('#password', password)
await page.click('#ur-frontend-form > form > div > div > div > input')
await page.waitForSelector('#cse-main > div > div > section.cse-section.cse-section--links > div > a:nth-child(2)')
await page.click('#cse-main > div > div > section.cse-section.cse-section--links > div > a:nth-child(2)')
await page.waitForSelector('#cse-main > div.cse-page.cse-page--networking.cse-global-bg > section.cse-section.cse-section--userslist > div > div.cse-userslist-button > a')
await page.click('#cse-main > div.cse-page.cse-page--networking.cse-global-bg > section.cse-section.cse-section--userslist > div > div.cse-userslist-button > a')
EDIT
var names = await page.$$eval('.cse-ul--name',
elements=> elements.map(item=>item.textContent))
Works but doesn't scrape all of the data, just the data that's visible.
CodePudding user response:
You can use Beautiful soup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser') # html = the given html page from your question
# looks for a div, class='cse-ul--name' and decodes the contents of it
print(soup.find('div', 'cse-ul--name').decode_contents())
# looks for a div, class='cse-ul--position' and decodes the contents of it
print(soup.find('div', 'cse-ul--position').decode_contents())
# looks for a div, class='cse-ul--role' and decodes the contents of it
print(soup.find('div', 'cse-ul--role').decode_contents())