I am using beautifulsoup
to scrape the data. There are multiple urls and I have to save the data I scrape from these urls in the same CSV file. When I try to scrape from separate files and save to the same CSV file, the data in the last url I scraped in the CSV file is there. Below is the piece of code that I scraped the data from.
images = []
pages = np.arange(1, 2, 1)
for page in pages:
url = "https://www.bkmkitap.com/sanat"
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
book_div = soup.find_all("div", class_="col col-12 drop-down hover lightBg")
sleep(randint(2, 10))
for bookSection in book_div:
img_url = bookSection.find("img", class_="lazy stImage").get('data-src')
images.append(img_url)
books = pd.DataFrame(
{
"Image": images,
} )
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')
CodePudding user response:
Your question isn't very clear. When you run this, I assume a csv gets created with all the image urls, and you want to rerun this same script and have other image URL's get appended to the same csv? If that is the case, then you only need to change the to_csv
function call to:
books.to_csv("bkm_art.csv", mode='a', index=False, header=False ,encoding = 'utf-8-sig')
Adding mode='a'
starts appending to the file instead of overwriting it (doc).
CodePudding user response:
import numpy as np
import pandas as pd
pages = np.arange(1, 2, 1)
for page in pages:
print(page)
try it , you will find you just get 1
may be you can use
pages = range(1, 2, 1)
CodePudding user response:
you can use request module of python to request and scrap the data and after that using pandas you can convert it into csv file.
https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.html
pandas.to_csv() can be used
CodePudding user response:
Main issue in your example is that you do not get the second page, so you wont get these results - Iterate all of them and then create your CSV.
Second one, as you want to append data to an existing file, is figured out by @M B
Note: Try to avoid selecting your elements by classes, cause they arr more dynamic then id
or HTML structure
Example
import requests, random
from bs4 import BeautifulSoup
data = []
for page in range(1, 3, 1):
url = f"https://www.bkmkitap.com/sanat?pg={page}"
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.content, "html.parser")
for bookSection in soup.select('[id*="product-detail"]'):
data.append({
'image':bookSection.find("img", class_="lazy stImage").get('data-src')
})
books = pd.DataFrame(data)
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')
Output
image
0 https://cdn.bkmkitap.com/sanat-dunyamiz-190-ey...
1 https://cdn.bkmkitap.com/sanat-dunyamiz-189-te...
2 https://cdn.bkmkitap.com/tiyatro-gazetesi-sayi...
3 https://cdn.bkmkitap.com/mavi-gok-kultur-sanat...
4 https://cdn.bkmkitap.com/sanat-dunyamiz-iki-ay...
... ...
112 https://cdn.bkmkitap.com/hayal-perdesi-iki-ayl...
113 https://cdn.bkmkitap.com/cins-aylik-kultur-der...
114 https://cdn.bkmkitap.com/masa-dergisi-sayi-48-...
115 https://cdn.bkmkitap.com/istanbul-sanat-dergis...
116 https://cdn.bkmkitap.com/masa-dergisi-sayi-49-...
117 rows × 1 columns