I am trying to export tag items from HTMLon a TXTfile. For some reason my code is only taking the last line and exporting it to the CSV. It won't scrape the other listed items. Not sure why. I tried multiple solutions but nothing.
Here is my code...
import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
import requests
baseurl = 'https://www.soxboxmtl.com'
dataset = []
with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.txt', "r") as f:
soup = BeautifulSoup(f.read(), "html.parser")
for imgurl in soup.find_all('img', class_='grid-item-image'):(imgurl['data-src'])
for name in soup.find_all('div', class_='grid-title'):(name.text)
for link in soup.find_all('a', class_='grid-item-link'):(link['href'])
for price in soup.find_all('div', class_='product-price'):(price.text)
dataset.append({'Field_01':(imgurl['data-src']),'Field_02':name.text,'Field_03':(baseurl link['href']),'Field_04':price.text})
print(dataset)
df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.csv', index = False)
Here is a sample of HTML data
<div data-controller="ProductListImageLoader" data-item-id="625ef30d651884142d5a2dc2" id="thumb-kitsch-paddle-hair-brush">
<a aria-label="Kitsch Paddle Hair Brush" href="/home-bath-body/p/kitsch-paddle-hair-brush">
</a>
<figure data-animation-role="image" data-test="plp-grid-image">
<div >
<img alt="Screenshot 2022-04-19 at 1.31.04 PM.png" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot 2022-04-19 at 1.31.04 PM.png" data-image-dimensions="1341x1335" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot 2022-04-19 at 1.31.04 PM.png"/>
<img alt="Screenshot 2022-04-19 at 1.31.24 PM.png" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot 2022-04-19 at 1.31.24 PM.png" data-image-dimensions="1338x1338" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot 2022-04-19 at 1.31.24 PM.png"/>
<div >
<span data-group="5ec69b56a188e3129c377b33" data-id="625ef30d651884142d5a2dc2" role="button" tabindex="0">Quick View</span>
</div>
</div>
</figure>
<section data-animation-role="content">
<div >
<div data-test="plp-grid-title">
Kitsch Paddle Hair Brush
</div>
<div data-test="plp-grid-prices">
<div >
CA$24.00
</div>
</div>
</div>
<div data-test="plp-grid-status">
<div >
Only 2 left in stock
</div>
</div>
</section>
</div>
<div data-controller="ProductListImageLoader" data-item-id="635031c65ac9872b4ba44f5a" id="thumb-pj-salvage-luxe-plush-embroidered-blanket-blush">
<a aria-label="PJ Salvage Luxe Plush Embroidered Blanket - Blush" href="/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush">
</a>
<figure data-animation-role="image" data-test="plp-grid-image">
<div >
<img alt="Screenshot 2022-10-17 at 12.03.06 AM.png" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot 2022-10-17 at 12.03.06 AM.png" data-image-dimensions="891x1340" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot 2022-10-17 at 12.03.06 AM.png"/>
<img alt="Screenshot 2022-10-17 at 12.02.56 AM.png" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot 2022-10-17 at 12.02.56 AM.png" data-image-dimensions="890x1339" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot 2022-10-17 at 12.02.56 AM.png"/>
<div >
<span data-group="5ec69b56a188e3129c377b33" data-id="635031c65ac9872b4ba44f5a" role="button" tabindex="0">Quick View</span>
</div>
</div>
</figure>
<section data-animation-role="content">
<div >
<div data-test="plp-grid-title">
PJ Salvage Luxe Plush Embroidered Blanket - Blush
</div>
<div data-test="plp-grid-prices">
<div >
CA$118.00
</div>
</div>
</div>
<div data-test="plp-grid-status">
<div >
Only 1 left in stock
</div>
</div>
CodePudding user response:
It is because the for-loops go through, but always overwrite the values, so that only the last value remains, which is then added to the dataset
.
Recommendation - Try to simplify and orient yourself to the container element with class grid-item
that contains the information, iterate over all these containers and then add the data to your dataset
.
This way you only need a single for-loop, which is easier to control.
Following example uses css selectors
as I prefer to work with these:
...
soup = BeautifulSoup(f.read(), "html.parser")
for e in soup.select('.grid-item'):
dataset.append({
'Field_01':e.img.get('data-src'),
'Field_02':e.select_one('.grid-title').get_text(strip=True),
'Field_03':baseurl e.a.get('href'),
'Field_04':e.select_one('.product-price').get_text(strip=True)
})
but you can use find_all()
or find()
instead as well. Check also get_text()
and its parameters, to get rid of breaks or whitespaces.
for e in soup.find_all('div', class_='grid-item'):
dataset.append({
'Field_01':e.find('img', class_='grid-item-image').get('data-src'),
'Field_02':e.find('div', class_='grid-title').get_text(strip=True),
'Field_03':baseurl e.find('a', class_='grid-item-link').get('href'),
'Field_04':e.find('div', class_='product-price').get_text(strip=True)
})
This will lead to:
Field_01 | Field_02 | Field_03 | Field_04 |
---|---|---|---|
https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot 2022-04-19 at 1.31.04 PM.png | Kitsch Paddle Hair Brush | https://www.soxboxmtl.com/home-bath-body/p/kitsch-paddle-hair-brush | CA$24.00 |
https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot 2022-10-17 at 12.03.06 AM.png | PJ Salvage Luxe Plush Embroidered Blanket - Blush | https://www.soxboxmtl.com/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush | CA$118.00 |
CodePudding user response:
There are two problems with your current implementation:
Problem 1
Your loops do not actually do anything with the data that bs4 finds. The only thing adding data to your data set is the single call to dataset.append()
, which results in the single line of data you experienced.
Problem 2
Even if the loops were functional, the script would likely fail because of pandas DataFrames requiring a consistent column length. For example, there are more images than there are titles, so you will end up with columns of varying length.
Solution
Besides making sure that we're actually appending data correctly, we need to ensure that all columns are formatted correctly and consistently. Rather than searching for any and all information with no relation to each other, we instead search for all parent elements that contain the information relating to our needs.
We then iterate over the list of parent elements. Inside each iteration, we search only that parent element for usable data, then format it for use in a DataFrame. This DataFrame is appended to our list of DataFrames, which is concatenated into a single DataFrame once the iterations are done, and finally exported.
# Find all the grid-items first.
sections = soup.find_all('div', {'class': 'grid-item'}, recursive=True)
# We will append our formatted data to this list, then
# provide it to the DataFrame on creation
df_items = []
# Format and add the data from each grid-item to the DataFrame.
for section in sections:
title = section.find('a', {'class': 'grid-item-link'})
imgs = section.findAll('img')
price = section.find('div', {'class': 'product-price'})
data = {
'Field_01': [img['data-src'] for img in imgs],
'Field_02': [title['aria-label']],
'Field_03': [baseurl title['href']],
'Field_04': [''.join(price.text.split())],
}
# DataFrames require all arrays to be the same length.
# This automatically fills in any missing cells.
df = pd.DataFrame.from_dict(data, orient='index')
df = df.transpose()
# Append the DataFrame to our list of DataFrames.
df_items.append(df)
# Concatenate all dataframes.
result = pd.concat(df_items)
# Export
result.to_csv('data.csv', index=False)