I am a beginner. I have a static array from which I'd like to pull its variables successively and assign them as the column title to each iteration of a for loop. For example, after the first loop, assign the first variable in the col_titles as the column title. After the second loop, assign the second variable in the col_titles as the column title, and so on. Here's what I have going so far:
data = []
col_titles = ['30024`, '30033', '30038']
urls = [
'https://www.example.com/page1',
'https://www.example.com/page2',
'https://www.example.com/page3
]
counter = 1
for url in urls:
driver.get(url)
h2s = driver.find_elements(By.TAG_NAME, 'h2')
try:
for h2 in h2s:
if counter <= 5:
data.append(h2.get_attribute("innerText"))
counter = counter 1
except (ElementNotVisibleException, NoSuchElementException):
data.append("None")
driver.close()
print(data)
Currently, the output is an array containing all the variables from each loop like so (with each h2 reflecting unique h2 titles from each url):
[h2, h2, h2, h2, h2, h2, h2, h2, None, None, h2, h2, h2, h2, None]
This is fine, as all I've done is append each iteration to the "data" array.
This is where I get stuck.
I think I should be creating a DataFrame within the for loop to grab a column title from the "col_titles" array, assigning it as a column title following (or preceding) each iteration of the for loop, but I don't know how to do this properly. What I'm hoping to achieve is an output like the following:
30024 30033 30038
h2 h2 h2
h2 h2 h2
h2 h2 h2
h2 None h2
h2 None None
Any insight is very appreciated!
CodePudding user response:
First you create dictionary, and add key from col_titles and assign value from each iteration which you get a list. And zip dictionary to dataframe- Code will be something like -
col_titles = ['30024`, '30033', '30038']
urls = [
'https://www.example.com/page1',
'https://www.example.com/page2',
'https://www.example.com/page3
]
counter = 1
ctr = 0
my_dict={}
for url in urls:
driver.get(url)
h2s = driver.find_elements(By.TAG_NAME, 'h2')
data = []
try:
for h2 in h2s:
if counter <= 5:
data.append(h2.get_attribute("innerText"))
counter = counter 1
except (ElementNotVisibleException, NoSuchElementException):
data.append("None")
driver.close()
ctr = ctr 1
my_dict[col_titles[ctr]] = data
df = pd.DataFrame(my_dict)
print(df)
CodePudding user response:
Use collections.defaultdict and zip
function.
To get the result which is then passed to pandas DataFrame as columns/values it's more convenient in your case to use a dictionary-like data structure.
Instead of data = []
initialize:
from collections import defaultdict
data = defaultdict(list)
Then you iterate over your urls
and accumulate values for each column separately:
for col, url in zip(*[col_titles, urls]):
driver.get(url)
h2s = driver.find_elements(By.TAG_NAME, 'h2')
try:
for h2 in h2s:
if counter <= 5:
data[col].append(h2.get_attribute("innerText"))
counter = counter 1
except (ElementNotVisibleException, NoSuchElementException):
data[col].append("None")
driver.close()
Eventually, when generating dataframe as pd.DataFrame(data)
you'll get a structure like (similar) this:
30024 30033 30038
0 h2 h2 h2
1 h2 h2 h2
2 h2 h2 h2
3 h2 h2 h2
4 None None None