Home > Mobile >  Beautiful Soup: Extract text at the a anchor after url
Beautiful Soup: Extract text at the a anchor after url

Time:03-18

I have some html where the URL in the a href comes before the title that would appear on the page. I am trying to get at that title and url and extract that into a data frame. The following code is what I have so far.

import requests
from bs4 import BeautifulSoup

url = 'https://patentsview.org/download/data-download-tables'
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find_all("div", class_="file-title")
print(results)

pd.DataFrame([a.text for a in soup.select('.file-title a')], columns=['Title'])

As it stands, I only have the one column I would like the results to be in the following format:

Title URL
application URL1
assignee URL2
... ...

I was following this page on Real Python but I have have come to a standstill since I cannot seem to translate their next part into my needs.

Any help with this would be wonderful. Thank you in advance for your help.

EDIT 1: I have made some edits to the original question. I want to expand it to also include the URL that the title is attached to in a second column. I have also incorporated the code that was provided on the first answer.

CodePudding user response:

Just call .text on the <a> in each of the <div> to print your information:

for e in soup.find_all("div", class_="file-title"):
    print(e.a.text)

or with css selector:

for a in soup.select('.file-title a'):
    print(a.text)

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://patentsview.org/download/data-download-tables'
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

for e in soup.find_all("div", class_="file-title"):
    print(e.a.text)
Output
application
assignee
botanic
cpc_current
cpc_group
cpc_subgroup
cpc_subsection
figures
...

Or as DataFrame

pd.DataFrame([a.text for a in soup.select('.file-title a')], columns=['Title'])
Output:
Title
application
assignee
botanic
cpc_current
cpc_group
cpc_subgroup
cpc_subsection
figures
foreigncitation
foreign_priority
government_interest
government_organization
inventor
ipcr
lawyer
location
mainclass
mainclass_current

EDIT

Based on comment to get both "Title" and "Url"

data = []
for a in soup.select('.file-title a'):
    data.append({
        'Title':a.text,
        'Url':a['href']
    })
pd.DataFrame(data)
  • Related