During Python Scraping data storing into csv files just store a singles line of data-CodePudding

Csv files only store a single row of data and if i use range into csv then it execute only one lines again and again until fulfilments of range.

I can't fix this bugs,i took my 2 days.


for page in range(0,10):
    url = "https://cryptonews.net/?page={page}".format(page =page)
    # print(url)
 
# open the file in the write mode
    # f = open('file.csv', 'w',newline='' )
    header = ['Title', 'Tag', 'UTC','Web_Address']


    # write a row to the csv file
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    lists = soup.find_all("main")

    for lis in lists:
        title = lis.find('a', class_="title").text
        tag = lis.find('span', class_="etc-mark").text
        datetime = lis.find('span', class_="datetime").text
        address = lis.find('div', class_="middle-xs").text
        img = lis.find('span', class_="src")

        data =([title, tag, datetime,address,img])
 

counter = range(100)

with open('crypto.csv', 'a', newline='') as crypto:
    FileWriter = csv.writer(crypto)
    FileWriter.writerow(header)

    for x in counter:
        
         FileWriter.writerow(data)# writer.writerows(data)

CodePudding user response：

You aren't storing the data, and as stated, it's overwritten each time you iterate through the lists. Secondly, I'd opt to use pandas here to create a dataframe, then just write that to file.

Also, you collect 5 items to write, and only have 4 column names.

import pandas as pd
import requests
from bs4 import BeautifulSoup


data = []
for page in range(0,10):
    print(page)
    url = "https://cryptonews.net/?page={page}".format(page =page)
    # print(url)

    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    lists = soup.find_all("main")

    for lis in lists:
        title = lis.find('a', class_="title").text
        tag = lis.find('span', class_="etc-mark").text
        datetime = lis.find('span', class_="datetime").text
        address = lis.find('div', class_="middle-xs").text
        img = lis.find('span', class_="src")

        data.append([title, tag, datetime,address,img])
 
header = ['Title', 'Tag', 'UTC','Web_Address','Image']
df = pd.DataFrame(data, columns=header)
df.to_csv('crypto.csv', index=False)

Also, I'm nost sure what you want as the output (as you don't say). Is this more accurate?

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

data = []
for page in range(0,10):
    print(page)
    url = "https://cryptonews.net/?page={page}".format(page =page)
    # print(url)

    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    lists = soup.find_all("div", {'class':re.compile('^row news-item.*')})

    for lis in lists:
        title = lis['data-title']
        tag = lis.find('span', class_="etc-mark").text
        datetime = lis.find('span', class_=re.compile("^datetime")).text.strip()
        address = lis['data-domain']
        img = lis['data-image']

        data.append([title, tag, datetime,address,img])
 
header = ['Title', 'Tag', 'UTC','Web_Address','Image']
df = pd.DataFrame(data, columns=header)
df.to_csv('crypto.csv', index=False)

Output:

print(df)
                                                 Title  ...                                              Image
0    ETH Breaches $1,500 Level As Ethereum Adds Ove...  ...  https://cnews24.ru/uploads/e29/e29a5677e448f6e...
1    India Seeing Spike in Drug Smuggling Using Cry...  ...  https://cnews24.ru/uploads/65b/65b50302f65e12c...
2    Optimism (OP) Price Prediction: 87% Rally Is J...  ...  https://cnews24.ru/uploads/5e1/5e1189bbb2c1e2b...
3        Mysterious Whale Adds 3.94 Trillion Shiba Inu  ...  https://cnews24.ru/uploads/54a/54af6726248c29a...
4    Are the big fundraising efforts of blockchain ...  ...  https://cnews24.ru/uploads/5af/5afb066d81be4a6...
..                                                 ...  ...                                                ...
195  Terra Classic (LUNC) Chief Community Officer S...  ...  https://cnews24.ru/uploads/a53/a53fd4206ab5f95...
196  Reddit NFT Collection: How to Sell Your Avatar...  ...  https://cnews24.ru/uploads/ab6/ab6718f707c3428...
197  In Topsy Turvy Market Logic, Positive U.S. GDP...  ...  https://cnews24.ru/uploads/264/264ab9327f4774a...
198  XRP Wallets Spikes Above 4.34M, Gaining 29,883...  ...  https://cnews24.ru/uploads/2e5/2e56d092b7c253b...
199                     Are crypto trading bots legit?  ...  https://cnews24.ru/uploads/ccb/ccb73d9d9b79280...

[200 rows x 5 columns]

CodePudding user response：

First, you are setting data =([title, tag, datetime,address,img]) on every loop iteration but not saving it anywhere. The value of data is getting replaced on each loop iteration with the data from the next row, and you are not saving the entire dataset anywhere.

Then, you are passing the same thing ("data") to FileWriter.writerow() on every loop iteration, without ever changing the value of "data." You need to write the specific row for each loop iteration.

Fix both these issues and your code should work.