I am using Selenium to extract data from the HTML body of a webpage and am writing the data to a .csv file using pandas.
The data is extracted and written to the file, however I would like to manipulate the formatting of the data to write to specified columns, after reading many threads and docs I am not able to understand how to do this.
The current CSV file output is as follows, all data in one row or one column
0,
B09KBFH6HM,
dropdownAvailable,
90,
1,
B09KBNJ4F1,
dropdownAvailable,
100,
2,
B09KBPFPCL,
dropdownAvailable,
110
or if I use the [count] count =1 method it will be one row
0,B09KBFH6HM,dropdownAvailable,90,1,B09KBNJ4F1,dropdownAvailable,100,2,B09KBPFPCL,dropdownAvailable,110
I would like the output to be formatted as follows,
/col1 /col2 /col3 /col4
0, B09KBFH6HM, dropdownAvailable, 90,
1, B09KBNJ4F1, dropdownAvailable, 100,
2, B09KBPFPCL, dropdownAvailable, 110
I have tried using columns= options but get errors in the terminal and don't understand what feature I should be using to achieve this in the docs under the append details
A simplified version is as follows
from selenium import webdriver
import pandas as pd
price = []
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.amazon.co.jp/dp/B09KBK3T1S/")
select_box = driver.find_element_by_name("dropdown_selected_size_name")
options = [x for x in select_box.find_elements_by_tag_name("option")]
for element in options:
price.append(element.get_attribute("value"))
price.append(element.get_attribute("class"))
price.append(element.get_attribute("data-a-html-content"))
output = pd.DataFrame(price)
output.to_csv("Data.csv", encoding='utf-8-sig')
driver.close()
Do I need to parse each item separately and append? I would like each of the .get_attribute values to be written to a new column.
Is there any advice you can offer for a solution to this as I am not very proficient at pandas, thank you for your helps
CodePudding user response:
Adding all your items to the price
list is going to cause them all to be in one column. Instead, store separate lists for each column, in a dict, like this (name them whatever you want):
data = {
'values': [],
'classes': [],
'data_a_html_contents': [],
}
...
for element in options:
values.append(element.get_attribute("value"))
classes.append(element.get_attribute("class"))
data_a_html_contents.append(element.get_attribute("data-a-html-content"))
...
output = pd.DataFrame(data)
output.to_csv("Data.csv", encoding='utf-8-sig')
CodePudding user response:
Approach similar to @user17242583, but a little shorter:
data = [[e.get_attribute("value"), e.get_attribute("class"), e.get_attribute("data-a-html-content")] for e in options]
df = pd.DataFrame(data, columns=['ASIN', 'dropdownAvailable', 'size']) # third column maybe is the product size
df.to_csv("Data.csv", encoding='utf-8-sig')