Sorry for asking such a basic question, but I am new to coding and still have a difficult time manipulating loops/working with unfamiliar libraries.
I have a code that's supposed to save parsed data in an excel file. But, it is only saving the last iteration of the loop. Here's what my current code looks like
auto = pd.read_stata(r"D:\StataCopies\workersurlsSample.dta")
df = pd.DataFrame(auto)
filename = r'C:\Users\nglew\Desktop\writingpractice2.xlsx'
i = 0
while i<3:
class QuotesSpider(scrapy.Spider):
name = "spider"
start_urls = [
df.iloc[i,3]
]
def parse(self, response):
trs = response.css('span.wikibase-title')
if trs:
items = []
for quote in trs:
item = {
quote.css('span.wikibase-title-id::text').get().replace('(','').replace(')',''),
}
items.append(item)
df2 = pd.DataFrame(items)
with pd.ExcelWriter(filename, mode='a', engine="openpyxl", if_sheet_exists='overlay') as writer:
df2.to_excel(writer, sheet_name='Sheet1', startrow=i 1)
i = 1
The code grabs URL's from a database I have in Stata, then parses data from each of the URL's.
The result should be a list of codes that have the format 'Q000000'. But I am only getting a singular code for when i=2 as opposed to 3 separate codes. Heres what my result in excel looks like
How can I reformat the loop so that not only the last iteration saves, but every iteration from i=0 to i=2?
CodePudding user response:
Not the most efficient solution, but you could check whether the file exists, and if so read it, and append the new lines afterwards.
Additionally, your df2 = pd.DataFrame(items)
call should have one less indentation. Otherwise, you'll be creating a new dataframe and saving it for every iteration of your for/loop
.
# Extra import you'll need to check whether the file exists.
# You could also use Path from pathlib, but `os.path` should do the trick
import os
auto = pd.read_stata(r"D:\StataCopies\workersurlsSample.dta")
df = pd.DataFrame(auto)
filename = r'C:\Users\nglew\Desktop\writingpractice2.xlsx'
sheet_name='Sheet1'
i = 0
while i < 3:
class QuotesSpider(scrapy.Spider):
name = "spider"
start_urls = [df.iloc[i,3]]
def parse(self, response):
trs = response.css('span.wikibase-title')
if trs:
items = [
item = {
quote.css('span.wikibase-title-id::text').get().replace('(','').replace(')',''),
} for quote in trs
]
df2 = pd.DataFrame(items)
if os.path.exists(filename):
df_temp = pd.read_excel(filename, sheet_name=sheet_name)
df2 = pd.concat([df_temp, df2], axis=0)
df2.to_excel(filename, sheet_name=sheet_name, index=False)
i = 1
I would also advise you against creating a new class
inside a while/loop
. Why don't you define the class
outside the while/loop
, create an instance of it, and simply call its methods from within the while
statement instead?