Home > Net >  Scraping: No attribute find_all for <p>
Scraping: No attribute find_all for <p>

Time:09-10

Good morning :)

I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313

The text I am trying to get seems to be located inside some <p> and separated by <br>. For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().

My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):

from selenium import webdriver
import time
from bs4 import BeautifulSoup


options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)

url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)

content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")

column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")

Is there anything obvious I am not understanding here?

Thanks a lot in advance for your help!

CodePudding user response:

You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.

columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
    people_in_column = column.find("p").get_text(strip=True)
    print(people_in_column)

Full working code as an example:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)

content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")

columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
    people_in_column = column.find("p").get_text(strip=True)
    print(people_in_column)

Output:

Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate 
review.
  • Related