I'm new to using Beautiful soup for web scrapping. I'm trying to extract a table from https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all but it's not working and I can't seem to find why. Here's what I did
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all"
r = requests.get(url) #### recupérer le html
soup = BeautifulSoup(r.content) #### parser ce txt en html
table = soup.find("table",{"id":"theDataTable","class":"display dataTable no-footer"})
it can't find the table ! why is that?
CodePudding user response:
It's within the <script>
tag. Need to pull it out and parse it.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
url = 'https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var tableData1' in str(script):
jsonStr = str(script)
jsonStr = str(script).split('var tableData1 = ', 1)[-1]
while True:
try:
jsonData = json.loads(jsonStr)
break
except:
jsonStr = jsonStr.rsplit(';', 1)[0]
df = pd.DataFrame(jsonData)
df.columns = ['Conditions','Studies']
df['Conditions'] = [BeautifulSoup(x, 'html.parser').text for x in list(df['Conditions'])]
Output:
Conditions Studies
0 ACTH Syndrome, Ectopic 8
1 ACTH-Secreting Pituitary Adenoma 62
2 ACTH-independent Macronodular Adrenal Hyperplasia 2
3 ADCY5-related Dyskinesia 2
4 ADNP Syndrome 2
... ...
5653 46, XX Disorders of Sex Development 58
5654 47 XXX Syndrome 2
5655 47, XYY Syndrome 3
5656 5-Nucleotidase Syndrome 1
5657 5q- Syndrome 1
[5658 rows x 2 columns]
CodePudding user response:
Here is a working code:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/ct2/search/browse?brwse=cond_alpha_all"
r = requests.get(url) # Fetch the page
soup = BeautifulSoup(r.content, "html.parser") # Parse the page in HTML
table = soup.find("table", { "id": "theDataTable", "class": ["display", "dataTable", "no-footer"]})
What is not working in your code is the soup.find(...)
statement, where you wrote "class":"display dataTable no-footer"
instead of "class": ["display", "dataTable", "no-footer"]
.
BeautifulSoup requires you to pass the several classes as an array of strings, not as a single string.
You will notice that I also added "html.parser"
as second argument in the BeautifulSoup(...)
constructor. While this is not mandatory, it is better to put it to avoid the GuessedAtParserWarning: No parser was explicitly specified,[...]
warning that Python could throw.
You can find the documentation of the libraries here: