I am working on a project to analyze the SuperCluster Astronaut Database. I am trying to scrape the data for each astronaut into a nice, clean pandas dataframe. There is plenty of descriptive information about each astronaut available for scraping. However, when you click on the astronaut, more information is revealed - you can get a couple of paragraphs of their biography. I would like to scrape that, but need to automate some sort of action through which the link is clicked, and then the data is scraped from the page I was routed to.
Here is my attempt at that so far:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(10)
bio_data = []
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
for i in name:
btn = driver.find_element_by_css_selector('cb.super_card__link_grid').click()
bio = item.select_one('px1.pb1').get_text()
bio_data.append([bio])
data.append([name,bio_data])
cols=['name','bio']
df = pd.DataFrame(data,columns=cols)
print(df)
I'm getting an error that reads:
InvalidSessionIdException: Message: invalid session id
Not sure how to resolve this issue. Can someone help point me in the right direction? Any help would be appreciated!
CodePudding user response:
InvalidSessionIdException
InvalidSessionIdException occurs incase the given session id is not in the list of active sessions, which indicates the session either does not exist or the session is not active.
This usecase
Possibly Selenium driven ChromeDriver initiated google-chrome-headless Browsing Context is geting detected as a bot and the session is getting terminated.
References
You can find a couple of relevant detailed discussions in:
CodePudding user response:
Every link contains individual page and bio data. So no click,you have to collect each url and have to send request again to collect bio data from each/individual page.
Script:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
import requests
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch order'
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(url)
time.sleep(5)
Name=[]
bio=[]
soup = BeautifulSoup(driver.page_source, 'lxml')
for name in soup.select('.bau.astronaut_cell__title.bold.mr05'):
name =name.text
Name.append(name)
#print(name)
urls=soup.select('a[]')
for url in urls:
abs_url='https://www.supercluster.com' url.get('href')
print(abs_url)
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.maximize_window()
driver.get(abs_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
for astro in soup.select('div.h4')[0:8]:
astro=astro.text
bio.append(astro)
df = pd.DataFrame(data=list(zip(Name,bio)),columns=['name','bio'])
print(df)
Output:
name bio
0 Nield, George b. Jul 31, 1950
1 Kitchen, Jim Human
2 Lai, Gary Male
3 Hagle, Marc President Commercial Space Technologies
4 Hagle, Sharon b. Jul 31, 1950
.. ... ...
295 Wilcutt, Terrence Lead Operations Engineer
296 Linenger, Jerry b. Oct 1, 1975
297 Mukai, Chiaki Human
298 Thomas, Donald Male
299 Chiao, Leroy People's Liberation Army Air Force Data Missin...
[300 rows x 2 columns]