I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')
for item in tags:
name = item.select_one('bau astronaut_cell__title bold mr05')
country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
data.append([name,country])
df = pd.DataFrame(data)
df
df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.
Any help would be appreciated! Thank you!
CodePudding user response:
The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:
url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)
Once you have it loaded, you can iterate through it and retrieve relevant information. For example:
for astro in data['astronauts']:
print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])
Output:
1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General
etc.
You can then load the output to a pandas dataframe or whatever.
CodePudding user response:
The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.
Script:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
data = []
url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
tags = soup.select('.astronaut_cell.x')
for item in tags:
name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
#print(name.text)
country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
if country:
country=country.get_text()
#print(country)
data.append([name, country])
cols=['name','country']
df = pd.DataFrame(data,columns=cols)
print(df)
Output:
name country
0 Bess, Cameron United States of America
1 Bess, Lane United States of America
2 Dick, Evan United States of America
3 Taylor, Dylan United States of America
4 Strahan, Michael United States of America
.. ... ...
67 Hopkins, Michael United States of America
68 Ryazansky, Sergey Russia
69 Yaping, Wang China
70 Xiaoguang, Zhang China
71 Parmitano, Luca Italy
[72 rows x 2 columns]