I want to retrieve data from the below website, I want to get all the pages table information and put them in the excel form. It goes through all pages but each time it erase the excel sheet and renew them for writing. at the end I just have the last page not the total pages table. would you help me?
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome('C:\Webdriver\chromedriver.exe')
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
for J in range (20):
commodities = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[8]/abbr')
Canada_Result=[]
for i in range(25):
temporary_data= {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
df_data = pd.DataFrame(Canada_Result)
df_data
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)
# click on the Next button
driver.find_element_by_xpath('//*[@id="report_results_next"]').click()
time.sleep(1)
CodePudding user response:
I would suggest before entering running the code, create an excel with the desired name and column names, after that do not create a DataFrame, but instead read the excel.
Your problem is being caused because everytime the loop repeats, its deleting the previous data and replacing the already existing excel.
so instead of
df = pd.DataFrame(Canada_Result)
I would recommend just reading the excel with
df = pd.read_excel('Canada_scrapping_result.xlsx')
CodePudding user response:
You have to initialize the Canada_Result
list before entering the outer for
loop, then just append new data, and when the outer loop is ended convert the list to a dataframe and export it to a file.
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome('C:\Webdriver\chromedriver.exe')
driver.get('https://www150.statcan.gc.ca/n1/pub/71-607-x/2021004/imp-eng.htm?r1=(1)&r2=0&r3=0&r4=12&r5=0&r7=0&r8=2022-01-01&r9=2022-01-01')
time.sleep(2)
Canada_Result = []
for J in range (20):
commodities = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[2]/a')
Countries = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[4]')
quantities = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[7]')
weights = driver.find_elements_by_xpath('.//*[@id="report_table"]/tbody/tr["i"]/td[8]/abbr')
for i in range(25):
temporary_data = {'Commodity': commodities[i].text,'Country': Countries[i].text,'quantity': quantities[i].text, 'weight': weights[i].text }
Canada_Result.append(temporary_data)
# click on the Next button
driver.find_element_by_xpath('//*[@id="report_results_next"]').click()
time.sleep(1)
df_data = pd.DataFrame(Canada_Result)
df_data.to_excel('Canada_scrapping_result.xlsx', index=False)