Using Selenium to click page and scrape Info from routed page-CodePudding

I am working on a project to analyze the SuperCluster Astronaut Database. I am trying to scrape the data for each astronaut into a nice, clean pandas dataframe. There is plenty of descriptive information about each astronaut available for scraping. However, when you click on the astronaut, more information is revealed - you can get a couple of paragraphs of their biography. I would like to scrape that, but need to automate some sort of action through which the link is clicked, and then the data is scraped from the page I was routed to.

Here is my attempt at that so far:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time



data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)

bio_data = []

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    for i in name:
        btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
        bio = item.select_one('px1.pb1').get_text()
        bio_data.append([bio])
        
    data.append([name,bio_data])



cols=['name','bio']
df = pd.DataFrame(data,columns=cols)

print(df)

I'm getting an error that reads:

InvalidSessionIdException: Message: invalid session id

Not sure how to resolve this issue. Can someone help point me in the right direction? Any help would be appreciated!

CodePudding user response：

InvalidSessionIdException

InvalidSessionIdException occurs incase the given session id is not in the list of active sessions, which indicates the session either does not exist or the session is not active.

This usecase

Possibly Selenium driven ChromeDriver initiated google-chrome-headless Browsing Context is geting detected as a bot and the session is getting terminated.

References

You can find a couple of relevant detailed discussions in:

selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python

CodePudding user response：

Every link contains individual page and bio data. So no click,you have to collect each url and have to send request again to collect bio data from each/individual page.

Script:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
import requests



data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch order'

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
    name =name.text
    Name.append(name)
    #print(name)
    urls=soup.select('a[]')
    for url in urls:
        abs_url='https://www.supercluster.com' url.get('href')
        print(abs_url)
        options = Options()
        options.add_argument("--headless")
        driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
        driver.maximize_window()
        driver.get(abs_url)
        time.sleep(5)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        driver.close()

        for astro in soup.select('div.h4')[0:8]:
            astro=astro.text
            bio.append(astro)


df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)

Output:

      name                                                    bio
0        Nield, George                                    b. Jul 31, 1950
1         Kitchen, Jim                                              Human
2            Lai, Gary                                               Male
3          Hagle, Marc            President Commercial Space Technologies
4        Hagle, Sharon                                    b. Jul 31, 1950
..                 ...                                                ...
295  Wilcutt, Terrence                           Lead Operations Engineer
296    Linenger, Jerry                                     b. Oct 1, 1975
297      Mukai, Chiaki                                              Human
298     Thomas, Donald                                               Male
299       Chiao, Leroy  People's Liberation Army Air Force Data Missin...

[300 rows x 2 columns]