Home > Software design >  Can't find a table using Beautiful soup
Can't find a table using Beautiful soup

Time:11-25

I'm new to using Beautiful soup for web scrapping. I'm trying to extract a table from https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all but it's not working and I can't seem to find why. Here's what I did

import requests
from bs4 import BeautifulSoup

url = "https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all"

r = requests.get(url) #### recupérer le html
soup = BeautifulSoup(r.content) #### parser ce txt en html
table = soup.find("table",{"id":"theDataTable","class":"display dataTable no-footer"}) 

it can't find the table ! why is that?

CodePudding user response:

It's within the <script> tag. Need to pull it out and parse it.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

url = 'https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script')
for script in scripts:
    if 'var tableData1' in str(script):
        jsonStr = str(script)
        jsonStr = str(script).split('var tableData1 = ', 1)[-1]
        
        while True:
            try:
                jsonData = json.loads(jsonStr)
                break
            except:
                jsonStr = jsonStr.rsplit(';', 1)[0]
        


df = pd.DataFrame(jsonData)
df.columns = ['Conditions','Studies']
df['Conditions'] = [BeautifulSoup(x, 'html.parser').text for x in list(df['Conditions'])]

Output:

Conditions Studies
0                                ACTH Syndrome, Ectopic       8
1                      ACTH-Secreting Pituitary Adenoma      62
2     ACTH-independent Macronodular Adrenal Hyperplasia       2
3                              ADCY5-related Dyskinesia       2
4                                         ADNP Syndrome       2
                                                ...     ...
5653                46, XX Disorders of Sex Development      58
5654                                    47 XXX Syndrome       2
5655                                   47, XYY Syndrome       3
5656                            5-Nucleotidase Syndrome       1
5657                                       5q- Syndrome       1

[5658 rows x 2 columns]

CodePudding user response:

Here is a working code:

import requests
from bs4 import BeautifulSoup

url = "https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all"

r = requests.get(url) # Fetch the page
soup = BeautifulSoup(r.content, "html.parser") # Parse the page in HTML
table = soup.find("table", { "id": "theDataTable", "class": ["display", "dataTable", "no-footer"]})

What is not working in your code is the soup.find(...) statement, where you wrote "class":"display dataTable no-footer" instead of "class": ["display", "dataTable", "no-footer"].

BeautifulSoup requires you to pass the several classes as an array of strings, not as a single string.

You will notice that I also added "html.parser" as second argument in the BeautifulSoup(...) constructor. While this is not mandatory, it is better to put it to avoid the GuessedAtParserWarning: No parser was explicitly specified,[...] warning that Python could throw.

You can find the documentation of the libraries here:

  • Related