Home > Blockchain >  Append URL from HREF tag to additional column python
Append URL from HREF tag to additional column python

Time:04-22

I am using the following code to get a table from this page. However, I want to include an additional column that parses out the URL from the 'href' tag in the final column called Volume Page. This is my first attempt at using python for web scraping, so any assistance would be appreciated.

    import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/fda-debarment-list-drug-product-applications'
r = requests.get(url)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    output.append([item for item in cols if item])

df = pd.DataFrame(output, columns = ['Last Name', 'First Name','Effective Date','End Term','FR Date', 'Volume Page'])
df = df.iloc[1:]
print(a)
print(df)

CodePudding user response:

Another version:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/fda-debarment-list-drug-product-applications"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select("table")[-1]
df = pd.read_html(str(table))[0]

df["URL"] = [tr.a["href"] if tr.a else "-" for tr in table.select("tr:has(td)")]

print(df.to_markdown())

Prints:

Last Name First & Middle Names Effective Date End/Term of Debarment FR Date.txt (MM/DD/YY) Volume Page.pdf URL
0 Akhigbe Ehigiator O. 12/17/2010 25 Year% 12/17/2010 75 FR 79005 https://www.federalregister.gov/documents/2010/12/17/2010-31776/ehigiator-o-akhigbe-debarment-order
1 Albanese Anthony W. 11/23/2009 Permanent^ 11/23/2009 74 FR 61151 https://www.federalregister.gov/documents/2009/11/23/E9-28084/anthony-w-albanese-debarment-order
2 Aminzada Mirwaiss 06/22/2015 Permanent^ 06/22/2015 80 FR 35652 https://www.federalregister.gov/documents/2015/06/22/2015-15162/mirwaiss-aminzada-debarment-order
3 Anthony James Michael 11/07/1997 Permanent^ 11/07/1997 62 FR 60249 https://www.gpo.gov/fdsys/pkg/FR-1997-11-07/pdf/97-29399.pdf
4 Azeem Mohammed 04/26/1993 Permanent^ 04/26/1993 58 FR 21982~ -
5 nan nan nan FR Correction 05/05/1993 58 FR 26814~ -
6 Babich Michael L. 11/27/2020 Permanent^ 11/27/2020 85 FR 76084 https://www.federalregister.gov/documents/2020/11/27/2020-26226/michael-l-babich-final-debarment-order
7 Bae Kun Chae 12/30/1993 Permanent^* 12/30/1993 58 FR 69368~

...

CodePudding user response:

Simply append() the url if available to your cols:

...
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    if row.a:
        cols.append(row.a.get('href'))
    else:
        cols.append('None')
    output.append(cols)
...

Output:

Last Name First Name Effective Date End Term FR Date Volume Page URL
1 Akhigbe Ehigiator O. 12/17/2010 25 Year% 12/17/2010 75 FR 79005 https://www.federalregister.gov/documents/2010/12/17/2010-31776/ehigiator-o-akhigbe-debarment-order
2 Albanese Anthony W. 11/23/2009 Permanent^ 11/23/2009 74 FR 61151 https://www.federalregister.gov/documents/2009/11/23/E9-28084/anthony-w-albanese-debarment-order
3 Aminzada Mirwaiss 06/22/2015 Permanent^ 06/22/2015 80 FR 35652 https://www.federalregister.gov/documents/2015/06/22/2015-15162/mirwaiss-aminzada-debarment-order
4 Anthony James Michael 11/07/1997 Permanent^ 11/07/1997 62 FR 60249 https://www.gpo.gov/fdsys/pkg/FR-1997-11-07/pdf/97-29399.pdf
5 Azeem Mohammed 04/26/1993 Permanent^ 04/26/1993 58 FR 21982~ None
6 FR Correction 05/05/1993 58 FR 26814~ None
  • Related