I'm trying to scrape jpg images from each products, every product url saved in csv. Image links are available in json data so try to access json key value. When I try to run code it's only getting back with all key value in spite of image url link, and second my code only able to scrape last product url in spite of all url saved in csv.
{'name': {'b': {'src': {'xs': 'https://ctl.s6img.com/society6/img/xVx1vleu7iLcR79ZkRZKqQiSzZE/w_125/artwork/~artwork/s6-0041/a/18613683_5971445', 'lg': 'https://ctl.s6img.com/society6/img/W-ESMqUtC_oOEUjx-1E_SyIdueI/w_550/artwork/~artwork/s6-0041/a/18613683_5971445', 'xl': 'https://ctl.s6img.com/society6/img/z90VlaYwd8cxCqbrZ1ttAxINpaY/w_700/artwork/~artwork/s6-0041/a/18613683_5971445', 'xxl': None}, 'type': 'image', 'alt': "I'M NOT ALWAYS A BITCH (Red) Cutting Board", 'meta': None}, 'c': {'src': {'xs': 'https://ctl.s6img.com/society6/img/KQJbb4jG0gBHcqQiOCivLUbKMxI/w_125/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'lg': 'https://ctl.s6img.com/society6/img/ztGrxSpA7FC1LfzM3UldiQkEi7g/w_550/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xl': 'https://ctl.s6img.com/society6/img/PHjp9jDic2NGUrpq8k0aaxsYZr4/w_700/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xxl': 'https://ctl.s6img.com/society6/img/m-1HhSM5CIGl6DY9ukCVxSmVDIw/w_1500/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg'}, 'type': 'image', 'alt': "I'M NOT ALWAYS A BITCH (Red) Cutting Board", 'meta': None}, 'd': {'src': {'xs': 'https://ctl.s6img.com/society6/img/G9TikRnVvy1w0kwKCAmgWsWy42Q/w_125/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'lg': 'https://ctl.s6img.com/society6/img/uVOYOxbHmhrNhmGQAi6QeydrFdY/w_550/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xl': 'https://ctl.s6img.com/society6/img/-WIIUx9oB6jQKJdkSkq2ofhjLzc/w_700/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg', 'xxl': 'https://ctl.s6img.com/society6/img/HlSFppIm7Wk6aVxO17fI4b5s0ts/w_1500/cutting-board/rectangle/front/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg'}, 'type': 'image', 'alt': "I'M NOT ALWAYS A BITCH (Red) Cutting Board", 'meta': None}}}
This is the json data. I only want to scrape jpg image link. Below is my code:
import json
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
contents = []
with open('test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
newlist = []
for url in contents:
try:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, 'html.parser')
scripts = soup.find_all('script')[7].text.strip()[24:]
data = json.loads(scripts)
link = data['product']['response']['product']['data']['attributes']['media_map']
except:
link = 'no data'
detail = {
'name': link
}
print(detail)
newlist.append(detail)
df = pd.DataFrame(detail)
df.to_csv('s1.csv')
I'm trying to scrape all jpg image link and I save csv file having each product url so I want to open csv file and loop each url.
CodePudding user response:
Few things:
df = pd.DataFrame(detail)
should bedf = pd.DataFrame(newlist)
- You're loop indentation is off. In fact, why are you looping the urls twice? You get the url from the test.csv (you should just use pandas for that anyway), puting the url into
contents
list, then loop through that list.
Try this:
import json
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
contents = []
with open('test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
try:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, 'html.parser')
scripts = soup.find_all('script')[7].text.strip()[24:]
data = json.loads(scripts)
link = data['product']['response']['product']['data']['attributes']['media_map']
except:
link = 'no data'
detail = {
'name': link
}
print(detail)
contents.append(detail)
df = pd.DataFrame(contents)
df.to_csv('s1.csv')