Home > database >  How to scrape HTML from TXT and store all items to CSV?
How to scrape HTML from TXT and store all items to CSV?

Time:01-06

I am trying to export tag items from HTMLon a TXTfile. For some reason my code is only taking the last line and exporting it to the CSV. It won't scrape the other listed items. Not sure why. I tried multiple solutions but nothing.

Here is my code...

import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
import requests


baseurl = 'https://www.soxboxmtl.com'

dataset = []

with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.txt', "r") as f:

        
        soup = BeautifulSoup(f.read(), "html.parser")
        for imgurl in soup.find_all('img', class_='grid-item-image'):(imgurl['data-src'])
        for name in soup.find_all('div', class_='grid-title'):(name.text)    
        for link in soup.find_all('a', class_='grid-item-link'):(link['href'])  
        for price in soup.find_all('div', class_='product-price'):(price.text)
       
        dataset.append({'Field_01':(imgurl['data-src']),'Field_02':name.text,'Field_03':(baseurl   link['href']),'Field_04':price.text})
        
        print(dataset)

        df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.csv', index = False)

Here is a sample of HTML data

<div  data-controller="ProductListImageLoader" data-item-id="625ef30d651884142d5a2dc2" id="thumb-kitsch-paddle-hair-brush">
    <a aria-label="Kitsch Paddle Hair Brush"  href="/home-bath-body/p/kitsch-paddle-hair-brush">
    </a>
    <figure  data-animation-role="image" data-test="plp-grid-image">
    <div >
    <img alt="Screenshot 2022-04-19 at 1.31.04 PM.png"  data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot 2022-04-19 at 1.31.04 PM.png" data-image-dimensions="1341x1335" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot 2022-04-19 at 1.31.04 PM.png"/>
    <img alt="Screenshot 2022-04-19 at 1.31.24 PM.png"  data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot 2022-04-19 at 1.31.24 PM.png" data-image-dimensions="1338x1338" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot 2022-04-19 at 1.31.24 PM.png"/>
    <div >
    <span  data-group="5ec69b56a188e3129c377b33" data-id="625ef30d651884142d5a2dc2" role="button" tabindex="0">Quick View</span>
    </div>
    </div>
    </figure>
    <section  data-animation-role="content">
    <div >
    <div  data-test="plp-grid-title">
            Kitsch Paddle Hair Brush
          </div>
    <div  data-test="plp-grid-prices">
    <div >
    CA$24.00
    </div>
    </div>
    </div>
    <div  data-test="plp-grid-status">
    <div >
        Only 2 left in stock
      </div>
    </div>
    </section>
    </div>
    <div  data-controller="ProductListImageLoader" data-item-id="635031c65ac9872b4ba44f5a" id="thumb-pj-salvage-luxe-plush-embroidered-blanket-blush">
    <a aria-label="PJ Salvage Luxe Plush Embroidered Blanket - Blush"  href="/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush">
    </a>
    <figure  data-animation-role="image" data-test="plp-grid-image">
    <div >
    <img alt="Screenshot 2022-10-17 at 12.03.06 AM.png"  data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot 2022-10-17 at 12.03.06 AM.png" data-image-dimensions="891x1340" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot 2022-10-17 at 12.03.06 AM.png"/>
    <img alt="Screenshot 2022-10-17 at 12.02.56 AM.png"  data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot 2022-10-17 at 12.02.56 AM.png" data-image-dimensions="890x1339" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot 2022-10-17 at 12.02.56 AM.png"/>
    <div >
    <span  data-group="5ec69b56a188e3129c377b33" data-id="635031c65ac9872b4ba44f5a" role="button" tabindex="0">Quick View</span>
    </div>
    </div>
    </figure>
    <section  data-animation-role="content">
    <div >
    <div  data-test="plp-grid-title">
            PJ Salvage Luxe Plush Embroidered Blanket - Blush
          </div>
    <div  data-test="plp-grid-prices">
    <div >
    CA$118.00
    </div>
    </div>
    </div>
    <div  data-test="plp-grid-status">
    <div >
        Only 1 left in stock
      </div>
    </div>

CodePudding user response:

It is because the for-loops go through, but always overwrite the values, so that only the last value remains, which is then added to the dataset.

Recommendation - Try to simplify and orient yourself to the container element with class grid-item that contains the information, iterate over all these containers and then add the data to your dataset. This way you only need a single for-loop, which is easier to control.

Following example uses css selectors as I prefer to work with these:

...
soup = BeautifulSoup(f.read(), "html.parser")
for e in soup.select('.grid-item'):
    dataset.append({
        'Field_01':e.img.get('data-src'),
        'Field_02':e.select_one('.grid-title').get_text(strip=True),
        'Field_03':baseurl   e.a.get('href'),
        'Field_04':e.select_one('.product-price').get_text(strip=True)
    })

but you can use find_all() or find() instead as well. Check also get_text() and its parameters, to get rid of breaks or whitespaces.

for e in soup.find_all('div', class_='grid-item'):
        dataset.append({
            'Field_01':e.find('img', class_='grid-item-image').get('data-src'),
            'Field_02':e.find('div', class_='grid-title').get_text(strip=True),
            'Field_03':baseurl   e.find('a', class_='grid-item-link').get('href'),
            'Field_04':e.find('div', class_='product-price').get_text(strip=True)
        })

This will lead to:

Field_01 Field_02 Field_03 Field_04
https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot 2022-04-19 at 1.31.04 PM.png Kitsch Paddle Hair Brush https://www.soxboxmtl.com/home-bath-body/p/kitsch-paddle-hair-brush CA$24.00
https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot 2022-10-17 at 12.03.06 AM.png PJ Salvage Luxe Plush Embroidered Blanket - Blush https://www.soxboxmtl.com/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush CA$118.00

CodePudding user response:

There are two problems with your current implementation:

Problem 1

Your loops do not actually do anything with the data that bs4 finds. The only thing adding data to your data set is the single call to dataset.append(), which results in the single line of data you experienced.

Problem 2

Even if the loops were functional, the script would likely fail because of pandas DataFrames requiring a consistent column length. For example, there are more images than there are titles, so you will end up with columns of varying length.

Solution

Besides making sure that we're actually appending data correctly, we need to ensure that all columns are formatted correctly and consistently. Rather than searching for any and all information with no relation to each other, we instead search for all parent elements that contain the information relating to our needs.

We then iterate over the list of parent elements. Inside each iteration, we search only that parent element for usable data, then format it for use in a DataFrame. This DataFrame is appended to our list of DataFrames, which is concatenated into a single DataFrame once the iterations are done, and finally exported.

# Find all the grid-items first.
sections = soup.find_all('div', {'class': 'grid-item'}, recursive=True)

# We will append our formatted data to this list, then
# provide it to the DataFrame on creation
df_items = []

# Format and add the data from each grid-item to the DataFrame.
for section in sections:
    title = section.find('a', {'class': 'grid-item-link'})
    imgs = section.findAll('img')
    price = section.find('div', {'class': 'product-price'})

    data = {
        'Field_01': [img['data-src'] for img in imgs],
        'Field_02': [title['aria-label']],
        'Field_03': [baseurl   title['href']],
        'Field_04': [''.join(price.text.split())],
    }

    # DataFrames require all arrays to be the same length.
    # This automatically fills in any missing cells.
    df = pd.DataFrame.from_dict(data, orient='index')
    df = df.transpose()

    # Append the DataFrame to our list of DataFrames.
    df_items.append(df)

# Concatenate all dataframes.
result = pd.concat(df_items)

# Export
result.to_csv('data.csv', index=False)
  • Related