Home > database >  Reformatting loop to save every iteration of data
Reformatting loop to save every iteration of data

Time:06-28

Sorry for asking such a basic question, but I am new to coding and still have a difficult time manipulating loops/working with unfamiliar libraries.

I have a code that's supposed to save parsed data in an excel file. But, it is only saving the last iteration of the loop. Here's what my current code looks like

auto = pd.read_stata(r"D:\StataCopies\workersurlsSample.dta")
df = pd.DataFrame(auto)

filename = r'C:\Users\nglew\Desktop\writingpractice2.xlsx'

i = 0
while i<3:
    class QuotesSpider(scrapy.Spider):
        name = "spider"
        start_urls = [
            df.iloc[i,3]
            ]
        def parse(self, response):
            trs = response.css('span.wikibase-title')
            if trs:
                items = []
                for quote in trs:
                    item = {
                        quote.css('span.wikibase-title-id::text').get().replace('(','').replace(')',''),
                    }
                    items.append(item)

                    df2 = pd.DataFrame(items)
                    with pd.ExcelWriter(filename, mode='a', engine="openpyxl", if_sheet_exists='overlay') as writer:
                        df2.to_excel(writer, sheet_name='Sheet1', startrow=i   1)
    i  = 1

The code grabs URL's from a database I have in Stata, then parses data from each of the URL's.

The result should be a list of codes that have the format 'Q000000'. But I am only getting a singular code for when i=2 as opposed to 3 separate codes. Heres what my result in excel looks like

Excel Output

How can I reformat the loop so that not only the last iteration saves, but every iteration from i=0 to i=2?

CodePudding user response:

Not the most efficient solution, but you could check whether the file exists, and if so read it, and append the new lines afterwards.

Additionally, your df2 = pd.DataFrame(items) call should have one less indentation. Otherwise, you'll be creating a new dataframe and saving it for every iteration of your for/loop.


# Extra import you'll need to check whether the file exists.
# You could also use Path from pathlib, but `os.path` should do the trick
import os


auto = pd.read_stata(r"D:\StataCopies\workersurlsSample.dta")
df = pd.DataFrame(auto)

filename = r'C:\Users\nglew\Desktop\writingpractice2.xlsx'
sheet_name='Sheet1'

i = 0
while i < 3:

    class QuotesSpider(scrapy.Spider):
        name = "spider"
        start_urls = [df.iloc[i,3]]

        def parse(self, response):

            trs = response.css('span.wikibase-title')
            if trs:
                items = [
                    item = {
                        quote.css('span.wikibase-title-id::text').get().replace('(','').replace(')',''),
                    } for quote in trs
                ]
                df2 = pd.DataFrame(items)
                if os.path.exists(filename):
                    df_temp = pd.read_excel(filename, sheet_name=sheet_name)
                    df2 = pd.concat([df_temp, df2], axis=0)
                df2.to_excel(filename, sheet_name=sheet_name, index=False)
    i  = 1

I would also advise you against creating a new class inside a while/loop. Why don't you define the class outside the while/loop, create an instance of it, and simply call its methods from within the while statement instead?

  • Related