Home > Mobile >  Python web-scraping, long list output, doesn't fit in array
Python web-scraping, long list output, doesn't fit in array

Time:08-14

I am very new to programming and am learning by to making a webscraper from my BoardGameGeek account. I am attempting to scrape a table and convert it into a Dataframe.

I am able to pull the column headers in as a list, but the data from the table is one long line. Since there are 8 headers, but 483 items in the line it will not work.

My question is how do I breakup the line of data to fit in the Dataframe.

Code:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

# Create an URL object
url = 'https://boardgamegeek.com/collection/user/kyletravels?want=1&subtype=boardgame&ff=1'
# Create object page
pages = requests.get(url)
pages.text

# parser-lxml = Change html to Python friendly format
soup = BeautifulSoup(pages.text, 'lxml')

#Access <tbody> tag
table = soup.table

# Obtain information from tag <table>
table1 = soup.find('table', id='collectionitems')
table1

# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
 title = i.text
 headers.append(title)

info = []
for i in table1.find_all('td'):
    stats = i.text
    info.append(stats)


total = pd.DataFrame(data=headers, columns=info)

CodePudding user response:

pandas.read_html() is a quick and convenient way for scraping data from HTML tables and meets the requirements in 90% of cases. Simply select your table by attribute id:

import pandas as pd

pd.read_html('https://boardgamegeek.com/collection/user/kyletravels?want=1&subtype=boardgame&ff=1', attrs={'id':'collectionitems'})[0]

To answer your question, try to iterat the cells per row not all the cells at once:

info = []
for row in table1.select('tr:has(td)'):
        info.append([cell.get_text(strip=True) for cell in row.find_all('td')])

Be aware, there are more th in a row than td, so you may have to slice:

headers = []
for i in table1.find_all('th')[:-1]:
    title = i.get_text(strip=True)
headers.append(title)

Also switch the values for the attributes while creating dataframe to:

pd.DataFrame(data=info, columns=headers)

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Create an URL object
url = 'https://boardgamegeek.com/collection/user/kyletravels?want=1&subtype=boardgame&ff=1'

soup = BeautifulSoup(requests.get(url).text)

# Obtain information from tag <table>
table1 = soup.find('table', id='collectionitems')
table1

# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th')[:-1]:
    title = i.get_text(strip=True)
    headers.append(title)

info = []
for row in table1.select('tr:has(td)'):
        info.append([cell.get_text(strip=True) for cell in row.find_all('td')])

pd.DataFrame(data=info, columns=headers)
  • Related