Home > Software design >  LOOPING while scrapping from HTML
LOOPING while scrapping from HTML

Time:09-15

this is the code so far:

from bs4 import BeautifulSoup
from lxml import etree
import urllib.request as urllib2

# Initialize parser
parser = etree.HTMLParser()

# First page
url = "https://admn5015-340805.uc.r.appspot.com/2019-01-01.html"

webpage = urllib2.urlopen(url, timeout=10)
html = webpage.read().decode(encoding="utf-8")

soup = BeautifulSoup(html, "html.parser")

price = soup.find("td", {"id": "price"}).text
print(price)

likes = soup.find("td", {"id": "likes"}).text
print(likes)

dislikes = soup.find("td", {"id": "dislikes"}).text
print(dislikes)

followers = soup.find("td", {"id": "followers"}).text
print(followers)

This codes parse data from this particular webpage, now I have the 3 years of webpage with different dates which needs to extract the same data. how can I loop it and how can I store the data while parsing in a dataframe. The wep page name is the same just the date changes

CodePudding user response:

If you're writing any amount of python web spider - do yourself a favor and just learn how to use https://scrapy.org/

It supports everything you want, XSLT, cssselect. You build a pipeline that you easily write spiders that will ingest pages storing results as objects that you can write custom handlers for.

You can write scrapers that are task specific like you are, but once you learn scrapy you'll be hooked.

CodePudding user response:

If you want to loop then you need for-loop or while-loop


First you could put code in function which gets url as parameter

def parse(url):
    print('url:', url)
    
    response = urllib.request.urlopen(url, timeout=100)  # on my computer it needs longer `timeout`
    html = response.read().decode(encoding="utf-8")

    soup = BeautifulSoup(html, "html.parser")

    date = soup.find("td", {"id": "date"}).text
    print(date)

    price = soup.find("td", {"id": "price"}).text
    print(price)

    likes = soup.find("td", {"id": "likes"}).text
    print(likes)

    dislikes = soup.find("td", {"id": "dislikes"}).text
    print(dislikes)

    followers = soup.find("td", {"id": "followers"}).text
    print(followers)

    print('---')
    
    row = [url, date, price, likes, dislikes, followers]
    
    return row

And later you can use this function with list of url and for-loop

# - before loop -

all_urls = [
    "https://admn5015-340805.uc.r.appspot.com/2019-01-01.html",
    "https://admn5015-340805.uc.r.appspot.com/2019-01-02.html",
    "https://admn5015-340805.uc.r.appspot.com/2019-01-03.html",
]

all_rows = []

# - loop -

for url in all_urls:
    row = parse(url)
    all_rows.append(row)

And later you can convert all to dataframe

# - after loop -   

df = pd.DataFrame(all_rows, columns=['url', 'date', 'price', 'likes', 'dislikes', 'followers'])

print(df)

Full code with other changes:

Main problem is that this page has <meta charcode="utf8"> but file uses chars latin1 and it needs encoding="latin1"

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

# --- functions ---   # PEP8: all functions before main code

def parse(url):
    print('url:', url)
    
    response = urllib.request.urlopen(url, timeout=100)  # page needs longer `timeout`
    html = response.read().decode(encoding="latin1")

    soup = BeautifulSoup(html, "html.parser")

    date = soup.find("td", {"id": "date"})
    date = date.text if date else ""
    print('date:', date)

    price = soup.find("td", {"id": "price"})
    price = price.text if price else ""
    print('price:', price)

    likes = soup.find("td", {"id": "likes"})
    likes = likes.text if likes else ""
    print('likes:', likes)

    dislikes = soup.find("td", {"id": "dislikes"})
    dislikes = dislikes.text if dislikes else ""
    print('dislikes:', dislikes)

    followers = soup.find("td", {"id": "followers"})
    followers = followers.text if followers else ""
    print('followers:', followers)

    print('---')
    
    row = [url, date, price, likes, dislikes, followers]
    
    return row

# --- main ---

# - before loop -

all_urls = [
    "https://admn5015-340805.uc.r.appspot.com/2019-01-01.html",
    "https://admn5015-340805.uc.r.appspot.com/2019-01-02.html",
    "https://admn5015-340805.uc.r.appspot.com/2019-01-03.html",
]

all_rows = []

# - loop -

for url in all_urls:
    row = parse(url)
    all_rows.append(row)
   
# - after loop -   

df = pd.DataFrame(all_rows, columns=['url', 'date', 'price', 'likes', 'dislikes', 'followers'])

print(df)

df.to_csv('output.csv')

PEP 8 -- Style Guide for Python Code


EDIT:

If all your urls has dates then you could use datetime.date to create start date (and end data) and datetime.timedelta(days=1) to create step. And you can use while-loop

# - before loop -

all_rows = []

start = datetime.date(2019, 1, 1)
end = datetime.date.today()
step = datetime.timedelta(days=1)

# - loop -

while start <= end:
    url = start.strftime("https://admn5015-340805.uc.r.appspot.com/%Y-%m-%d.html")
    row = parse(url)
    all_rows.append(row)
    start  = step

It could be good to add try/except to catch some problems - but I skip this part.

  • Related