I am using the following code to get a table from this page. However, I want to include an additional column that parses out the URL from the 'href' tag in the final column called Volume Page. This is my first attempt at using python for web scraping, so any assistance would be appreciated.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/fda-debarment-list-drug-product-applications'
r = requests.get(url)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
cols = row.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['Last Name', 'First Name','Effective Date','End Term','FR Date', 'Volume Page'])
df = df.iloc[1:]
print(a)
print(df)
CodePudding user response:
Another version:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/fda-debarment-list-drug-product-applications"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select("table")[-1]
df = pd.read_html(str(table))[0]
df["URL"] = [tr.a["href"] if tr.a else "-" for tr in table.select("tr:has(td)")]
print(df.to_markdown())
Prints:
Last Name | First & Middle Names | Effective Date | End/Term of Debarment | FR Date.txt (MM/DD/YY) | Volume Page.pdf | URL | |
---|---|---|---|---|---|---|---|
0 | Akhigbe | Ehigiator O. | 12/17/2010 | 25 Year% | 12/17/2010 | 75 FR 79005 | https://www.federalregister.gov/documents/2010/12/17/2010-31776/ehigiator-o-akhigbe-debarment-order |
1 | Albanese | Anthony W. | 11/23/2009 | Permanent^ | 11/23/2009 | 74 FR 61151 | https://www.federalregister.gov/documents/2009/11/23/E9-28084/anthony-w-albanese-debarment-order |
2 | Aminzada | Mirwaiss | 06/22/2015 | Permanent^ | 06/22/2015 | 80 FR 35652 | https://www.federalregister.gov/documents/2015/06/22/2015-15162/mirwaiss-aminzada-debarment-order |
3 | Anthony | James Michael | 11/07/1997 | Permanent^ | 11/07/1997 | 62 FR 60249 | https://www.gpo.gov/fdsys/pkg/FR-1997-11-07/pdf/97-29399.pdf |
4 | Azeem | Mohammed | 04/26/1993 | Permanent^ | 04/26/1993 | 58 FR 21982~ | - |
5 | nan | nan | nan | FR Correction | 05/05/1993 | 58 FR 26814~ | - |
6 | Babich | Michael L. | 11/27/2020 | Permanent^ | 11/27/2020 | 85 FR 76084 | https://www.federalregister.gov/documents/2020/11/27/2020-26226/michael-l-babich-final-debarment-order |
7 | Bae | Kun Chae | 12/30/1993 | Permanent^* | 12/30/1993 | 58 FR 69368~ |
...
CodePudding user response:
Simply append()
the url if available to your cols
:
...
for row in rows:
cols = row.find_all('td')
cols = [item.text.strip() for item in cols]
if row.a:
cols.append(row.a.get('href'))
else:
cols.append('None')
output.append(cols)
...
Output:
Last Name | First Name | Effective Date | End Term | FR Date | Volume Page | URL | |
---|---|---|---|---|---|---|---|
1 | Akhigbe | Ehigiator O. | 12/17/2010 | 25 Year% | 12/17/2010 | 75 FR 79005 | https://www.federalregister.gov/documents/2010/12/17/2010-31776/ehigiator-o-akhigbe-debarment-order |
2 | Albanese | Anthony W. | 11/23/2009 | Permanent^ | 11/23/2009 | 74 FR 61151 | https://www.federalregister.gov/documents/2009/11/23/E9-28084/anthony-w-albanese-debarment-order |
3 | Aminzada | Mirwaiss | 06/22/2015 | Permanent^ | 06/22/2015 | 80 FR 35652 | https://www.federalregister.gov/documents/2015/06/22/2015-15162/mirwaiss-aminzada-debarment-order |
4 | Anthony | James Michael | 11/07/1997 | Permanent^ | 11/07/1997 | 62 FR 60249 | https://www.gpo.gov/fdsys/pkg/FR-1997-11-07/pdf/97-29399.pdf |
5 | Azeem | Mohammed | 04/26/1993 | Permanent^ | 04/26/1993 | 58 FR 21982~ | None |
6 | FR Correction | 05/05/1993 | 58 FR 26814~ | None |