Scraping Data with Beautiful Soup Issues-CodePudding

I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')

for item in tags:
    name = item.select_one('bau astronaut_cell__title bold mr05')
    country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
    data.append([name,country])
    
df = pd.DataFrame(data)

df

df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.

Any help would be appreciated! Thank you!

CodePudding user response：

The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:

url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)

Once you have it loaded, you can iterate through it and retrieve relevant information. For example:

for astro in data['astronauts']:
  print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank'])

Output:

1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General

etc.

You can then load the output to a pandas dataframe or whatever.

CodePudding user response：

The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.

Script:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch order'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    #print(name.text)
    country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
    if country:
        country=country.get_text()
    #print(country)
    
    data.append([name, country])



cols=['name','country']
df = pd.DataFrame(data,columns=cols)

print(df)

Output:

               name                   country
0       Bess, Cameron  United States of America
1          Bess, Lane  United States of America
2          Dick, Evan  United States of America
3       Taylor, Dylan  United States of America
4    Strahan, Michael  United States of America
..                ...                       ...
67   Hopkins, Michael  United States of America
68  Ryazansky, Sergey                    Russia
69       Yaping, Wang                     China
70   Xiaoguang, Zhang                     China
71    Parmitano, Luca                     Italy

[72 rows x 2 columns]