Home > Back-end >  How can I scrape data from multiple urls and save these data in the same csv file?
How can I scrape data from multiple urls and save these data in the same csv file?

Time:09-05

I am using beautifulsoup to scrape the data. There are multiple urls and I have to save the data I scrape from these urls in the same CSV file. When I try to scrape from separate files and save to the same CSV file, the data in the last url I scraped in the CSV file is there. Below is the piece of code that I scraped the data from.

images = []
pages = np.arange(1, 2, 1)
for page in pages:
    url = "https://www.bkmkitap.com/sanat"
    results = requests.get(url, headers=headers)
    soup = BeautifulSoup(results.content, "html.parser")
    book_div = soup.find_all("div", class_="col col-12 drop-down hover lightBg")
    sleep(randint(2, 10))
    for bookSection in book_div:
        img_url = bookSection.find("img", class_="lazy stImage").get('data-src')
        images.append(img_url)  
books = pd.DataFrame(
    {
        "Image": images,
} )
books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')

CodePudding user response:

Your question isn't very clear. When you run this, I assume a csv gets created with all the image urls, and you want to rerun this same script and have other image URL's get appended to the same csv? If that is the case, then you only need to change the to_csv function call to:

books.to_csv("bkm_art.csv", mode='a', index=False, header=False ,encoding = 'utf-8-sig')

Adding mode='a' starts appending to the file instead of overwriting it (doc).

CodePudding user response:

import numpy as np
import pandas as pd
pages = np.arange(1, 2, 1)
for page in pages:
    print(page)

try it , you will find you just get 1

may be you can use

pages = range(1, 2, 1)

CodePudding user response:

you can use request module of python to request and scrap the data and after that using pandas you can convert it into csv file.

https://www.tutorialspoint.com/requests/requests_web_scraping_using_requests.html

pandas.to_csv() can be used

CodePudding user response:

Main issue in your example is that you do not get the second page, so you wont get these results - Iterate all of them and then create your CSV.

Second one, as you want to append data to an existing file, is figured out by @M B

Note: Try to avoid selecting your elements by classes, cause they arr more dynamic then id or HTML structure

Example

import requests, random
from bs4 import BeautifulSoup

data = []

for page in range(1, 3, 1):
    url = f"https://www.bkmkitap.com/sanat?pg={page}"
    results = requests.get(url, headers=headers)
    soup = BeautifulSoup(results.content, "html.parser")
    
    for bookSection in soup.select('[id*="product-detail"]'):
        data.append({
            'image':bookSection.find("img", class_="lazy stImage").get('data-src')
        })
books = pd.DataFrame(data)

books.to_csv("bkm_art.csv", index=False, header=True,encoding = 'utf-8-sig')

Output

    image
0   https://cdn.bkmkitap.com/sanat-dunyamiz-190-ey...
1   https://cdn.bkmkitap.com/sanat-dunyamiz-189-te...
2   https://cdn.bkmkitap.com/tiyatro-gazetesi-sayi...
3   https://cdn.bkmkitap.com/mavi-gok-kultur-sanat...
4   https://cdn.bkmkitap.com/sanat-dunyamiz-iki-ay...
... ...
112 https://cdn.bkmkitap.com/hayal-perdesi-iki-ayl...
113 https://cdn.bkmkitap.com/cins-aylik-kultur-der...
114 https://cdn.bkmkitap.com/masa-dergisi-sayi-48-...
115 https://cdn.bkmkitap.com/istanbul-sanat-dergis...
116 https://cdn.bkmkitap.com/masa-dergisi-sayi-49-...
117 rows × 1 columns
  • Related