I've seen a couple of posts with this same question but their scripts usually waits until one of the elements (buttons) is clickable. Here is the table I'm trying to scrape:
https://ropercenter.cornell.edu/presidential-approval/highslows
First couple of tries my code was returning all the rows except both Polling Organization columns. Without changing anything, it now only scrapes the table headers and the tbody tag (no table rows).
url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)
driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])
Should I use explicit wait? If so, which condition should I wait for since the dynamic table is not interactive?
Also, why did the output of my code change after running it multiple times?
CodePudding user response:
Maybe more cheating, but easier solution, which indeed solves your problem, but in other way, would be to take a look what frontend does (using developer tools), and discover it calls the api, which returns JSON value, so no selenium is really needed. requests
and pandas
are enough.
import requests
import pandas as pd
url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"
data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
president.id president.active president.surname president.givenname president.shortname ... low.approve low.disapprove low.noOpinion low.sampleSize low.presidentName
0 e9c0d19b-dfe9-49cf-9939-d06a0f256e57 True Biden Joe None ... 33 53 13 1313.0 Joe Biden
1 bc9855d5-8e97-4448-b62e-1fb2865c79e6 True Trump Donald None ... 29 68 3 5360.0 Donald Trump
2 1c49881f-0f0c-4a53-9b2c-0dd6540f88e4 True Obama Barack None ... 37 57 5 1017.0 Barack Obama
3 ceda6415-5975-404d-8049-978758a7d1f8 True Bush George W. W. Bush ... 19 77 4 1100.0 George W. Bush
4 4f7344de-a7bd-4bc6-9147-87963ae51095 True Clinton Bill None ... 36 50 14 800.0 Bill Clinton
5 116721f1-f947-4c14-b0b5-d521ed5a4c8b True Bush George H.W. H.W. Bush ... 29 60 11 1001.0 George H.W. Bush
6 43720f8f-0b9f-43b0-8c0d-63da059e7a57 True Reagan Ronald None ... 35 56 9 1555.0 Ronald Reagan
7 7aa76fd3-e1bc-4e9a-b13c-463a64e0c864 True Carter Jimmy None ... 28 59 13 1542.0 Jimmy Carter
8 6255dd77-531d-46c6-bb26-627e2a4b3654 True Ford Gerald None ... 37 39 24 1519.0 Gerald Ford
9 f1a23b06-4200-41e6-b137-dd46260ac4d8 True Nixon Richard None ... 23 55 22 1589.0 Richard Nixon
10 772aabfd-289b-4f10-aaae-81a82dd3dbc6 True Johnson Lyndon B. None ... 35 52 13 1526.0 Lyndon B. Johnson
11 d849b5a8-f711-4ac9-9728-c3915e17bb6a True Kennedy John F. None ... 56 30 14 1550.0 John F. Kennedy
12 e22fd64a-cf20-4bc4-8db6-b4e71dc4483d True Eisenhower Dwight D. None ... 48 36 16 NaN Dwight D. Eisenhower
13 ab0bfa04-61da-49d1-8069-6992f6124f17 True Truman Harry S. None ... 22 65 13 NaN Harry S. Truman
14 11edf04f-9d8d-4678-976d-b9339b46705d True Roosevelt Franklin D. None ... 48 43 8 NaN Franklin D. Roosevelt
[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
'president.givenname', 'president.shortname', 'president.fullname',
'president.number', 'president.terms', 'president.ratings',
'president.termCount', 'president.ratingCount', 'high.id',
'high.active', 'high.organization.id', 'high.organization.active',
'high.organization.name', 'high.organization.ratingCount',
'high.pollingStart', 'high.pollingEnd', 'high.updated',
'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
'low.organization.id', 'low.organization.active',
'low.organization.name', 'low.organization.ratingCount',
'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
'low.presidentName'],
dtype='object')
CodePudding user response:
You can pull easily the mentioned table data selenium with pandas thus way.
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url='https://ropercenter.cornell.edu/presidential-approval/highslows'
driver.get(url)
table=BeautifulSoup(driver.page_source,'html.parser')
driver.close()
data=pd.read_html(str(table),header=0)[0]
#data.to_csv('table_data.csv',index=False)# to save into a file
print(data)
Output:
President ... Polling Organization & Dates.1
0 Joe Biden ... Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
1 Donald Trump ... PewJan 8th, 2021 - Jan 12th, 2021
2 Barack Obama ... Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
3 George W. Bush ... American Research GroupFeb 16th, 2008 - Feb 19...
4 Bill Clinton ... Yankelovich Partners / TIME / CNNMay 26th, 199...
5 George H.W. Bush ... Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
6 Ronald Reagan ... Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
7 Jimmy Carter ... Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
8 Gerald Ford ... Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
9 Richard Nixon ... Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
10 Lyndon B. Johnson ... Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
11 John F. Kennedy ... Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
12 Dwight D. Eisenhower ... Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
13 Harry S. Truman ... Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
14 Franklin D. Roosevelt ... Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
[15 rows x 5 columns]
CodePudding user response:
Using only Selenium, GeckoDriver and firefox to extract the table contents within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:
Code Block:
from selenium import webdriver from selenium.webdriver.firefox.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC import pandas as pd s = Service('C:\\BrowserDrivers\\geckodriver.exe') driver = webdriver.Firefox(service=s) driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows') tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML") tabledf = pd.read_html(tabledata) print(tabledf)
Console Output:
[ President Highest % ... Lowest % Polling Organization & Dates.1 0 Joe Biden 63% ... 33% Quinnipiac UniversityJan 7th, 2022 - Jan 10th,... 1 Donald Trump 49% ... 29% PewJan 8th, 2021 - Jan 12th, 2021 2 Barack Obama 76% ... 37% Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011 3 George W. Bush 92% ... 19% American Research GroupFeb 16th, 2008 - Feb 19... 4 Bill Clinton 73% ... 36% Yankelovich Partners / TIME / CNNMay 26th, 199... 5 George H.W. Bush 89% ... 29% Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992 6 Ronald Reagan 68% ... 35% Gallup OrganizationJan 28th, 1983 - Jan 31st, ... 7 Jimmy Carter 75% ... 28% Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979 8 Gerald Ford 71% ... 37% Gallup OrganizationJan 10th, 1975 - Jan 13th, ... 9 Richard Nixon 70% ... 23% Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974 10 Lyndon B. Johnson 80% ... 35% Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968 11 John F. Kennedy 83% ... 56% Gallup OrganizationSep 12th, 1963 - Sep 17th, ... 12 Dwight D. Eisenhower 78% ... 48% Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958 13 Harry S. Truman 87% ... 22% Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952 14 Franklin D. Roosevelt 84% ... 48% Gallup OrganizationAug 18th, 1939 - Aug 24th, ... [15 rows x 5 columns]]