I've cobbled together the following code that scrapes a website table using Beautiful Soup. The script is working as intended except for the first two entries. Q1: The first entry consists of two empty brackets... how do I omit them? Q2: The second entry has a hiden tab creating whitespace in the second element that I can't get rid of. How do I remove it?
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = "https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077"
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', class_='table table-striped')
df = pd.DataFrame(columns=['col1', 'col2'])
rows = []
for i, row in enumerate(table.find_all('tr')):
rows.append([el.text.strip() for el in row.find_all('td')])
for row in rows:
print(row)
Results:
[]
['Size', '12 -inch']
['Impedance (Ohms)', '4, 16']
['Cone Material', 'Mica-Filled IMPP']
['Surround Material', 'Rubber']
['Ideal Sealed Box Volume (cubic feet)', '1']
['Ideal Ported Box Volume (cubic feet)', '1.3']
['Port diameter (inches)', 'N/A']
['Port length (inches)', 'N/A']
['Free-Air', 'No']
['Dual Voice Coil', 'Yes']
['Sensitivity', '84.23 dB at 1 watt']
['Frequency Response', '24 - 200 Hz']
['Max RMS Power Handling', '400']
['Peak Power Handling (Watts)', '800']
['Top Mount Depth (inches)', '3 1/2']
['Bottom Mount Depth (inches)', 'N/A']
['Cutout Diameter or Length (inches)', '11 5/8']
['Vas (liters)', '34.12']
['Fs (Hz)', '32.66']
['Qts', '0.668']
['Xmax (millimeters)', '15.2']
['Parts Warranty', '1 Year']
['Labor Warranty', '1 Year']
CodePudding user response:
You can clean the results like this if you want.
rows = []
for i, row in enumerate(table.find_all('tr')):
cells = [
el.text.strip().replace("\t", "") ## remove tabs
for el
in row.find_all('td')
]
## don't add a row with no tds
if cells:
rows.append(cells)
I think you can further simplify this with a walrus :=
rows = [
[cell.text.strip().replace("\t", "") for cell in cells]
for row in table.find_all('tr')
if (cells := row.find_all('td'))
]
CodePudding user response:
Let's simplify, shall we?
import pandas as pd
df = pd.read_html('https://www.crutchfield.com/S-f7IbEJ40aHd/p_13692194/JL-Audio-12TW3-D8.html?tp=64077')[0]
df.columns = ['Property', 'Value', 'Not Needed']
print(df[['Property', 'Value']])
Result in terminal:
Property Value
0 Size 12 -inch
1 Impedance (Ohms) 4, 16
2 Cone Material Mica-Filled IMPP
3 Surround Material Rubber
4 Ideal Sealed Box Volume (cubic feet) 1
5 Ideal Ported Box Volume (cubic feet) 1.3
6 Port diameter (inches) NaN
7 Port length (inches) NaN
8 Free-Air No
9 Dual Voice Coil Yes
10 Sensitivity 84.23 dB at 1 watt
11 Frequency Response 24 - 200 Hz
12 Max RMS Power Handling 400
13 Peak Power Handling (Watts) 800
14 Top Mount Depth (inches) 3 1/2
15 Bottom Mount Depth (inches) NaN
16 Cutout Diameter or Length (inches) 11 5/8
17 Vas (liters) 34.12
18 Fs (Hz) 32.66
19 Qts 0.668
20 Xmax (millimeters) 15.2
21 Parts Warranty 1 Year
22 Labor Warranty 1 Year
Pandas documentation can be found here.