I am very new to programming and am learning by to making a webscraper from my BoardGameGeek account. I am attempting to scrape a table
and convert it into a Dataframe
.
I am able to pull the column headers in as a list, but the data from the table is one long line. Since there are 8 headers, but 483 items in the line it will not work.
My question is how do I breakup the line of data to fit in the Dataframe.
Code:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
# Create an URL object
url = 'https://boardgamegeek.com/collection/user/kyletravels?want=1&subtype=boardgame&ff=1'
# Create object page
pages = requests.get(url)
pages.text
# parser-lxml = Change html to Python friendly format
soup = BeautifulSoup(pages.text, 'lxml')
#Access <tbody> tag
table = soup.table
# Obtain information from tag <table>
table1 = soup.find('table', id='collectionitems')
table1
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
info = []
for i in table1.find_all('td'):
stats = i.text
info.append(stats)
total = pd.DataFrame(data=headers, columns=info)
CodePudding user response:
pandas.read_html()
is a quick and convenient way for scraping data from HTML tables and meets the requirements in 90% of cases. Simply select your table by attribute id
:
import pandas as pd
pd.read_html('https://boardgamegeek.com/collection/user/kyletravels?want=1&subtype=boardgame&ff=1', attrs={'id':'collectionitems'})[0]
To answer your question, try to iterat the cells per row not all the cells at once:
info = []
for row in table1.select('tr:has(td)'):
info.append([cell.get_text(strip=True) for cell in row.find_all('td')])
Be aware, there are more th
in a row than td
, so you may have to slice:
headers = []
for i in table1.find_all('th')[:-1]:
title = i.get_text(strip=True)
headers.append(title)
Also switch the values for the attributes while creating dataframe to:
pd.DataFrame(data=info, columns=headers)
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Create an URL object
url = 'https://boardgamegeek.com/collection/user/kyletravels?want=1&subtype=boardgame&ff=1'
soup = BeautifulSoup(requests.get(url).text)
# Obtain information from tag <table>
table1 = soup.find('table', id='collectionitems')
table1
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th')[:-1]:
title = i.get_text(strip=True)
headers.append(title)
info = []
for row in table1.select('tr:has(td)'):
info.append([cell.get_text(strip=True) for cell in row.find_all('td')])
pd.DataFrame(data=info, columns=headers)