Home > other >  How to append new data without deleting old ones? (Python,Pandas;BeautifulSoup)
How to append new data without deleting old ones? (Python,Pandas;BeautifulSoup)

Time:04-13

What I'm trying to do is that every time this file runs, it appends the new data down without deleting the old ones.

When I run it this way, it completely deletes the old file and rewrites it.

Also, if you have any other ideas where I can improve the code, I'd appreciate it if you could share them.

Thank you in advance for your help.

from bs4 import BeautifulSoup
import requests
import pandas as pd
from datetime import date
import time

url = 'https://www.teknosa.com/laptop-notebook-c-116004?s=:relevance:seller:teknosa&page={page}'
headers= {'User-Agent': 'Mozilla/5.0'}
data=[]
for page in range(1,6):
    req=requests.get(url.format(page=page),headers=headers)
    soup = BeautifulSoup(req.text, 'lxml')
    jobs = soup.find_all('div', class_='prd')
    t = time.localtime()
    current_time = time.strftime("%H:%M:%S", t)

    for job in jobs:
        data.append({
            'Tarih': date.today(),
            'Saat': current_time,
            'Ürün Açıklaması' : job.find('a', class_='prd-link')['title'],
            'Account Kod' : job.find('button', class_='prd-favorite btn-add-favorites')['data-product-id'],
            'Fiyat' : job.find('span', class_='prc prc-last').text.strip(),
            'URL': "https://www.teknosa.com"   job.find('a', class_='prd-link')['href'],
        })

def append_df_to_excel(df, excel_path):
    df_excel = pd.read_excel(excel_path)
    result = pd.concat([df_excel, df], ignore_index=True)
    result.to_excel(excel_path, index=False)

df = pd.DataFrame(data)
append_df_to_excel(df, r"test.xlsx")

Edit:Hello again everyone,

I found a code that will partially solve my problem, I wanted to share it with you, but now I am facing another problem.

Every time I run the file, it continues by corrupting the previous time format, as in the example.

Error Example

I could not understand whether the problem is related to the code or an excel error, I appreciate your help in advance.

CodePudding user response:

The below code will show the actual and the right way of scraping. It will minimize your code and time 5 times.

from bs4 import BeautifulSoup
import requests
import pandas as pd
from datetime import date
import time

url = 'https://www.teknosa.com/laptop-notebook-c-116004?s=:relevance:seller:teknosa&page={page}'
headers= {'User-Agent': 'Mozilla/5.0'}
data=[]
for page in range(1,6):
    req=requests.get(url.format(page=page),headers=headers)
    soup = BeautifulSoup(req.text, 'lxml')
    jobs = soup.find_all('div', class_='prd')
    t = time.localtime()
    current_time = time.strftime("%H:%M:%S", t)

    for job in jobs:
        data.append({
            'Tarih': date.today(),
            'Saat': current_time,
            'xy' : job.find('a', class_='prd-link')['title'],
            'Account Kod' : job.find('button', class_='prd-favorite btn-add-favorites')['data-product-id'],
            'Fiyat' : job.find('span', class_='prc prc-last').text.strip(),
            'URL': "https://www.teknosa.com"   job.find('a', class_='prd-link')['href'],
        })

df=pd.DataFrame(data)#.to_excel('test.xlsx')
print(df)
  • Related