Append URL from HREF tag to additional column python-CodePudding

I am using the following code to get a table from this page. However, I want to include an additional column that parses out the URL from the 'href' tag in the final column called Volume Page. This is my first attempt at using python for web scraping, so any assistance would be appreciated.

    import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/fda-debarment-list-drug-product-applications'
r = requests.get(url)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    output.append([item for item in cols if item])

df = pd.DataFrame(output, columns = ['Last Name', 'First Name','Effective Date','End Term','FR Date', 'Volume Page'])
df = df.iloc[1:]
print(a)
print(df)

CodePudding user response：

Another version:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/compliance-actions-and-activities/fda-debarment-list-drug-product-applications"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select("table")[-1]
df = pd.read_html(str(table))[0]

df["URL"] = [tr.a["href"] if tr.a else "-" for tr in table.select("tr:has(td)")]

print(df.to_markdown())

Prints:

	Last Name	First & Middle Names	Effective Date	End/Term of Debarment	FR Date.txt (MM/DD/YY)	Volume Page.pdf	URL
0	Akhigbe	Ehigiator O.	12/17/2010	25 Year%	12/17/2010	75 FR 79005	https://www.federalregister.gov/documents/2010/12/17/2010-31776/ehigiator-o-akhigbe-debarment-order
1	Albanese	Anthony W.	11/23/2009	Permanent^	11/23/2009	74 FR 61151	https://www.federalregister.gov/documents/2009/11/23/E9-28084/anthony-w-albanese-debarment-order
2	Aminzada	Mirwaiss	06/22/2015	Permanent^	06/22/2015	80 FR 35652	https://www.federalregister.gov/documents/2015/06/22/2015-15162/mirwaiss-aminzada-debarment-order
3	Anthony	James Michael	11/07/1997	Permanent^	11/07/1997	62 FR 60249	https://www.gpo.gov/fdsys/pkg/FR-1997-11-07/pdf/97-29399.pdf
4	Azeem	Mohammed	04/26/1993	Permanent^	04/26/1993	58 FR 21982~	-
5	nan	nan	nan	FR Correction	05/05/1993	58 FR 26814~	-
6	Babich	Michael L.	11/27/2020	Permanent^	11/27/2020	85 FR 76084	https://www.federalregister.gov/documents/2020/11/27/2020-26226/michael-l-babich-final-debarment-order
7	Bae	Kun Chae	12/30/1993	Permanent^*	12/30/1993	58 FR 69368~

...

CodePudding user response：

Simply append() the url if available to your cols:

...
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    if row.a:
        cols.append(row.a.get('href'))
    else:
        cols.append('None')
    output.append(cols)
...

Output:

	Last Name	First Name	Effective Date	End Term	FR Date	Volume Page	URL
1	Akhigbe	Ehigiator O.	12/17/2010	25 Year%	12/17/2010	75 FR 79005	https://www.federalregister.gov/documents/2010/12/17/2010-31776/ehigiator-o-akhigbe-debarment-order
2	Albanese	Anthony W.	11/23/2009	Permanent^	11/23/2009	74 FR 61151	https://www.federalregister.gov/documents/2009/11/23/E9-28084/anthony-w-albanese-debarment-order
3	Aminzada	Mirwaiss	06/22/2015	Permanent^	06/22/2015	80 FR 35652	https://www.federalregister.gov/documents/2015/06/22/2015-15162/mirwaiss-aminzada-debarment-order
4	Anthony	James Michael	11/07/1997	Permanent^	11/07/1997	62 FR 60249	https://www.gpo.gov/fdsys/pkg/FR-1997-11-07/pdf/97-29399.pdf
5	Azeem	Mohammed	04/26/1993	Permanent^	04/26/1993	58 FR 21982~	None
6				FR Correction	05/05/1993	58 FR 26814~	None