I have some html where the URL in the a href comes before the title that would appear on the page. I am trying to get at that title and url and extract that into a data frame. The following code is what I have so far.
import requests
from bs4 import BeautifulSoup
url = 'https://patentsview.org/download/data-download-tables'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all("div", class_="file-title")
print(results)
pd.DataFrame([a.text for a in soup.select('.file-title a')], columns=['Title'])
As it stands, I only have the one column I would like the results to be in the following format:
Title | URL |
---|---|
application | URL1 |
assignee | URL2 |
... | ... |
I was following this page on Real Python but I have have come to a standstill since I cannot seem to translate their next part into my needs.
Any help with this would be wonderful. Thank you in advance for your help.
EDIT 1: I have made some edits to the original question. I want to expand it to also include the URL that the title is attached to in a second column. I have also incorporated the code that was provided on the first answer.
CodePudding user response:
Just call .text
on the <a>
in each of the <div>
to print your information:
for e in soup.find_all("div", class_="file-title"):
print(e.a.text)
or with css selector
:
for a in soup.select('.file-title a'):
print(a.text)
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://patentsview.org/download/data-download-tables'
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
for e in soup.find_all("div", class_="file-title"):
print(e.a.text)
Output
application
assignee
botanic
cpc_current
cpc_group
cpc_subgroup
cpc_subsection
figures
...
Or as DataFrame
pd.DataFrame([a.text for a in soup.select('.file-title a')], columns=['Title'])
Output:
Title |
---|
application |
assignee |
botanic |
cpc_current |
cpc_group |
cpc_subgroup |
cpc_subsection |
figures |
foreigncitation |
foreign_priority |
government_interest |
government_organization |
inventor |
ipcr |
lawyer |
location |
mainclass |
mainclass_current |
EDIT
Based on comment to get both "Title" and "Url"
data = []
for a in soup.select('.file-title a'):
data.append({
'Title':a.text,
'Url':a['href']
})
pd.DataFrame(data)