this is the code so far:
from bs4 import BeautifulSoup
from lxml import etree
import urllib.request as urllib2
# Initialize parser
parser = etree.HTMLParser()
# First page
url = "https://admn5015-340805.uc.r.appspot.com/2019-01-01.html"
webpage = urllib2.urlopen(url, timeout=10)
html = webpage.read().decode(encoding="utf-8")
soup = BeautifulSoup(html, "html.parser")
price = soup.find("td", {"id": "price"}).text
print(price)
likes = soup.find("td", {"id": "likes"}).text
print(likes)
dislikes = soup.find("td", {"id": "dislikes"}).text
print(dislikes)
followers = soup.find("td", {"id": "followers"}).text
print(followers)
This codes parse data from this particular webpage, now I have the 3 years of webpage with different dates which needs to extract the same data. how can I loop it and how can I store the data while parsing in a dataframe. The wep page name is the same just the date changes
CodePudding user response:
If you're writing any amount of python web spider - do yourself a favor and just learn how to use https://scrapy.org/
It supports everything you want, XSLT, cssselect. You build a pipeline that you easily write spiders that will ingest pages storing results as objects that you can write custom handlers for.
You can write scrapers that are task specific like you are, but once you learn scrapy you'll be hooked.
CodePudding user response:
If you want to loop then you need for
-loop or while
-loop
First you could put code in function which gets url
as parameter
def parse(url):
print('url:', url)
response = urllib.request.urlopen(url, timeout=100) # on my computer it needs longer `timeout`
html = response.read().decode(encoding="utf-8")
soup = BeautifulSoup(html, "html.parser")
date = soup.find("td", {"id": "date"}).text
print(date)
price = soup.find("td", {"id": "price"}).text
print(price)
likes = soup.find("td", {"id": "likes"}).text
print(likes)
dislikes = soup.find("td", {"id": "dislikes"}).text
print(dislikes)
followers = soup.find("td", {"id": "followers"}).text
print(followers)
print('---')
row = [url, date, price, likes, dislikes, followers]
return row
And later you can use this function with list of url and for
-loop
# - before loop -
all_urls = [
"https://admn5015-340805.uc.r.appspot.com/2019-01-01.html",
"https://admn5015-340805.uc.r.appspot.com/2019-01-02.html",
"https://admn5015-340805.uc.r.appspot.com/2019-01-03.html",
]
all_rows = []
# - loop -
for url in all_urls:
row = parse(url)
all_rows.append(row)
And later you can convert all to dataframe
# - after loop -
df = pd.DataFrame(all_rows, columns=['url', 'date', 'price', 'likes', 'dislikes', 'followers'])
print(df)
Full code with other changes:
Main problem is that this page has <meta charcode="utf8">
but file uses chars latin1
and it needs encoding="latin1"
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
# --- functions --- # PEP8: all functions before main code
def parse(url):
print('url:', url)
response = urllib.request.urlopen(url, timeout=100) # page needs longer `timeout`
html = response.read().decode(encoding="latin1")
soup = BeautifulSoup(html, "html.parser")
date = soup.find("td", {"id": "date"})
date = date.text if date else ""
print('date:', date)
price = soup.find("td", {"id": "price"})
price = price.text if price else ""
print('price:', price)
likes = soup.find("td", {"id": "likes"})
likes = likes.text if likes else ""
print('likes:', likes)
dislikes = soup.find("td", {"id": "dislikes"})
dislikes = dislikes.text if dislikes else ""
print('dislikes:', dislikes)
followers = soup.find("td", {"id": "followers"})
followers = followers.text if followers else ""
print('followers:', followers)
print('---')
row = [url, date, price, likes, dislikes, followers]
return row
# --- main ---
# - before loop -
all_urls = [
"https://admn5015-340805.uc.r.appspot.com/2019-01-01.html",
"https://admn5015-340805.uc.r.appspot.com/2019-01-02.html",
"https://admn5015-340805.uc.r.appspot.com/2019-01-03.html",
]
all_rows = []
# - loop -
for url in all_urls:
row = parse(url)
all_rows.append(row)
# - after loop -
df = pd.DataFrame(all_rows, columns=['url', 'date', 'price', 'likes', 'dislikes', 'followers'])
print(df)
df.to_csv('output.csv')
PEP 8 -- Style Guide for Python Code
EDIT:
If all your urls has dates then you could use datetime.date
to create start date
(and end data
) and datetime.timedelta(days=1)
to create step
. And you can use while
-loop
# - before loop -
all_rows = []
start = datetime.date(2019, 1, 1)
end = datetime.date.today()
step = datetime.timedelta(days=1)
# - loop -
while start <= end:
url = start.strftime("https://admn5015-340805.uc.r.appspot.com/%Y-%m-%d.html")
row = parse(url)
all_rows.append(row)
start = step
It could be good to add try/except
to catch some problems - but I skip this part.