I am trying to write uni names, department names and ratings to a file from https://www.whatuni.com/university-course-reviews/?pageno=14. It goes well until I reach a post without a department name it gives me the error
file.write(user_name[k].text ";" uni_names[k].text ";" department[k].text ";" date_posted[k].text
IndexError: list index out of range
Here is the code I use. I believe I need to somehow write null or use space when the department doesn't exist. I use if not and else but it didn't work for me. I would appreciate any help. Thank you
for i in range(20):
try:
driver.refresh()
uni_names = driver.find_elements_by_xpath('//div[@]/h2/a')
department_names = driver.find_elements_by_xpath('//div[@]/h3/a')
user_name = driver.find_elements_by_xpath('//div[@]')
date_posted = driver.find_elements_by_xpath('//div[@]')
uni_rev = driver.find_elements_by_xpath('(//div[@]/div[@]/p)')
uni_rating = driver.find_elements_by_xpath('(//div[@]/div[@]/span[starts-with(@class,"ml5")])')
job_prospects = driver.find_elements_by_xpath('//span[text()="Job Prospects"]/following-sibling::span')
course_and_lecturers = driver.find_elements_by_xpath('//span[text()="Course and Lecturers"]/following-sibling::span')
if not course_and_lecturers:
lecturers= "None"
else:
lecturers = course_and_lecturers
uni_facilities = driver.find_elements_by_xpath('//span[text()= "Facilities" or "Uni Facilities"]/following-sibling::span')
if not uni_facilities:
facilities = "None"
else:
facilities = uni_facilities
student_support = driver.find_elements_by_xpath('//span[text()="Student Support"]/following-sibling::span')
if not student_support:
support = "None"
else:
support = student_support
with open('uni_scraping.csv', 'a') as file:
for k in range(len(uni_names)):
if not department_names:
department = "None"
else:
department = department_names
file.write(user_name[k].text ";" uni_names[k].text ";" department[k].text ";" date_posted[k].text
";" uni_rating[k].get_attribute("class") ";" job_prospects[k].get_attribute("class")
";" lecturers[k].get_attribute("class") ";" facilities[k].get_attribute("class")
";" support[k].get_attribute("class") ";" uni_rev[k].text "\n")
next_page = driver.find_element_by_class_name('mr0')
next_page.click()
file.close()
except exceptions.StaleElementReferenceException as e:
print('e')
pass
driver.close()
CodePudding user response:
You had a good feeling when you tried if not department_names
but it only works if the list is empty. In your case, the issue is that the list is too short.
Due to the universitie whithout departments, department_names
will be a shorter list than uni_names
.
As a result, in you loop for k in range(len(uni_names)):
the department[k].text
will not always be the department of the uni with the same index, and at some point k will have a greater value than your department list. That's why department[k]
will cause an error.
I don't know what is most efficient way to go around this but I think that you could get larger elements with the full details of every uni (the whole rlst_wrap for example), then search in it the details for the uni (with regexp for example). That way you would know when there is no department, and avoid the issue.
CodePudding user response:
Thank you Vimizen for the answer. I did what you suggested and it worked for me. I wrote something like this.
driver = webdriver.Chrome()
driver.get("https://www.whatuni.com/university-course-reviews/?pageno=14")
posts = []
driver.refresh()
post_elements = driver.find_elements_by_xpath('//div[@]')
for post_element_index in range(len(post_elements)):
post_element = post_elements[post_element_index]
uni_name = post_element.find_element_by_tag_name('h2')
try:
department_name = post_element.find_element_by_tag_name('h3')
department = department_name
department = department.text
except NoSuchElementException:
department = "aaaaaaaa"
user_name = post_element.find_element_by_class_name('rev_name')
postdict = {
"uni_name": uni_name.text,
"department": department,
"user_name": user_name.text
}
posts.append(postdict)
print(posts)
driver.close()
Best