Home > database >  Scraping non-interactable table from dynamic webpage
Scraping non-interactable table from dynamic webpage

Time:04-15

I've seen a couple of posts with this same question but their scripts usually waits until one of the elements (buttons) is clickable. Here is the table I'm trying to scrape:

https://ropercenter.cornell.edu/presidential-approval/highslows

First couple of tries my code was returning all the rows except both Polling Organization columns. Without changing anything, it now only scrapes the table headers and the tbody tag (no table rows).

url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)

driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])

Should I use explicit wait? If so, which condition should I wait for since the dynamic table is not interactive?

Also, why did the output of my code change after running it multiple times?

CodePudding user response:

Maybe more cheating, but easier solution, which indeed solves your problem, but in other way, would be to take a look what frontend does (using developer tools), and discover it calls the api, which returns JSON value, so no selenium is really needed. requests and pandas are enough.

import requests
import pandas as pd

url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"

data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
                            president.id  president.active president.surname president.givenname president.shortname  ... low.approve  low.disapprove low.noOpinion low.sampleSize      low.presidentName
0   e9c0d19b-dfe9-49cf-9939-d06a0f256e57              True             Biden                 Joe                None  ...          33              53            13         1313.0              Joe Biden
1   bc9855d5-8e97-4448-b62e-1fb2865c79e6              True             Trump              Donald                None  ...          29              68             3         5360.0           Donald Trump
2   1c49881f-0f0c-4a53-9b2c-0dd6540f88e4              True             Obama              Barack                None  ...          37              57             5         1017.0           Barack Obama
3   ceda6415-5975-404d-8049-978758a7d1f8              True              Bush           George W.             W. Bush  ...          19              77             4         1100.0         George W. Bush
4   4f7344de-a7bd-4bc6-9147-87963ae51095              True           Clinton                Bill                None  ...          36              50            14          800.0           Bill Clinton
5   116721f1-f947-4c14-b0b5-d521ed5a4c8b              True              Bush         George H.W.           H.W. Bush  ...          29              60            11         1001.0       George H.W. Bush
6   43720f8f-0b9f-43b0-8c0d-63da059e7a57              True            Reagan              Ronald                None  ...          35              56             9         1555.0          Ronald Reagan
7   7aa76fd3-e1bc-4e9a-b13c-463a64e0c864              True            Carter               Jimmy                None  ...          28              59            13         1542.0           Jimmy Carter
8   6255dd77-531d-46c6-bb26-627e2a4b3654              True              Ford              Gerald                None  ...          37              39            24         1519.0            Gerald Ford
9   f1a23b06-4200-41e6-b137-dd46260ac4d8              True             Nixon             Richard                None  ...          23              55            22         1589.0          Richard Nixon
10  772aabfd-289b-4f10-aaae-81a82dd3dbc6              True           Johnson           Lyndon B.                None  ...          35              52            13         1526.0      Lyndon B. Johnson
11  d849b5a8-f711-4ac9-9728-c3915e17bb6a              True           Kennedy             John F.                None  ...          56              30            14         1550.0        John F. Kennedy
12  e22fd64a-cf20-4bc4-8db6-b4e71dc4483d              True        Eisenhower           Dwight D.                None  ...          48              36            16            NaN   Dwight D. Eisenhower
13  ab0bfa04-61da-49d1-8069-6992f6124f17              True            Truman            Harry S.                None  ...          22              65            13            NaN        Harry S. Truman
14  11edf04f-9d8d-4678-976d-b9339b46705d              True         Roosevelt         Franklin D.                None  ...          48              43             8            NaN  Franklin D. Roosevelt

[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
       'president.givenname', 'president.shortname', 'president.fullname',
       'president.number', 'president.terms', 'president.ratings',
       'president.termCount', 'president.ratingCount', 'high.id',
       'high.active', 'high.organization.id', 'high.organization.active',
       'high.organization.name', 'high.organization.ratingCount',
       'high.pollingStart', 'high.pollingEnd', 'high.updated',
       'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
       'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
       'low.organization.id', 'low.organization.active',
       'low.organization.name', 'low.organization.ratingCount',
       'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
       'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
       'low.presidentName'],
      dtype='object')

CodePudding user response:

You can pull easily the mentioned table data selenium with pandas thus way.

    import pandas as pd
    from selenium import webdriver
    from bs4 import BeautifulSoup
    from webdriver_manager.chrome import ChromeDriverManager
    
    
    driver = webdriver.Chrome(ChromeDriverManager().install())
    url='https://ropercenter.cornell.edu/presidential-approval/highslows'
    driver.get(url)
    table=BeautifulSoup(driver.page_source,'html.parser')
    driver.close()
    data=pd.read_html(str(table),header=0)[0]
   #data.to_csv('table_data.csv',index=False)# to save into a file
    print(data)

Output:

               President  ...             Polling Organization & Dates.1
0               Joe Biden  ...  Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
1            Donald Trump  ...                  PewJan 8th, 2021 - Jan 12th, 2021
2            Barack Obama  ...  Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
3          George W. Bush  ...  American Research GroupFeb 16th, 2008 - Feb 19...
4            Bill Clinton  ...  Yankelovich Partners / TIME / CNNMay 26th, 199...
5        George H.W. Bush  ...  Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
6           Ronald Reagan  ...  Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
7            Jimmy Carter  ...  Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
8             Gerald Ford  ...  Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
9           Richard Nixon  ...   Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
10      Lyndon B. Johnson  ...  Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
11        John F. Kennedy  ...  Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
12   Dwight D. Eisenhower  ...  Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
13        Harry S. Truman  ...  Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
14  Franklin D. Roosevelt  ...  Gallup OrganizationAug 18th, 1939 - Aug 24th, ...

[15 rows x 5 columns]

CodePudding user response:

Using only Selenium, GeckoDriver and to extract the table contents within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.firefox.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    s = Service('C:\\BrowserDrivers\\geckodriver.exe')
    driver = webdriver.Firefox(service=s)
    driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows')
    tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML")
    tabledf = pd.read_html(tabledata)
    print(tabledf)
    
  • Console Output:

    [                President Highest %  ... Lowest %                     Polling Organization & Dates.1
    0               Joe Biden       63%  ...      33%  Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
    1            Donald Trump       49%  ...      29%                  PewJan 8th, 2021 - Jan 12th, 2021
    2            Barack Obama       76%  ...      37%  Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
    3          George W. Bush       92%  ...      19%  American Research GroupFeb 16th, 2008 - Feb 19...
    4            Bill Clinton       73%  ...      36%  Yankelovich Partners / TIME / CNNMay 26th, 199...
    5        George H.W. Bush       89%  ...      29%  Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
    6           Ronald Reagan       68%  ...      35%  Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
    7            Jimmy Carter       75%  ...      28%  Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
    8             Gerald Ford       71%  ...      37%  Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
    9           Richard Nixon       70%  ...      23%   Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
    10      Lyndon B. Johnson       80%  ...      35%  Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
    11        John F. Kennedy       83%  ...      56%  Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
    12   Dwight D. Eisenhower       78%  ...      48%  Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
    13        Harry S. Truman       87%  ...      22%  Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
    14  Franklin D. Roosevelt       84%  ...      48%  Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
    
    [15 rows x 5 columns]]
    
  • Related