I try to find out how to add a image id to a list and skip it in the next search. That is my code so far, i tried a lot out... the bot should always add the image that he copy recently to the 'used' blacklist and dont copy it next time.
search = True
used = []
driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH,value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
while True:
while search:
post = driver.find_element(By.CSS_SELECTOR,value='.post-container a img')
if post.id in used:
search = True
else:
search = False
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(20)
The problem: He add the used image to the list but he still find and copy it...
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
https://img-9gag-fun.9cache.com/photo/aWgNx36_460s.jpg
Fat acceptance activist on the news was so fat they had to put her in landscape mode
......
['4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b', '4d17ee3f-213a-4d54-a18f-425f8f4dea4b']
4d17ee3f-213a-4d54-a18f-425f8f4dea4b
EDIT: code:
while True:
driver.switch_to.window(gag_tab)
post = driver.find_elements(By.CSS_SELECTOR,value='.post-container a img')
for post in post:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
#paste the the url and title in to another site
time.sleep(20)
error:
Traceback (most recent call last):
File "main.py", line 86, in <module>
post_url = post.get_attribute('src')
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: chrome=101.0.4951.67)
Stacktrace:
Backtrace:
Ordinal0 [0x009CB8F3 2406643]
Ordinal0 [0x0095AF31 1945393]
Ordinal0 [0x0084C748 837448]
Ordinal0 [0x0084F154 848212]
Ordinal0 [0x0084F012 847890]
Ordinal0 [0x0084F98A 850314]
Ordinal0 [0x008A50C9 1200329]
Ordinal0 [0x0089427C 1131132]
Ordinal0 [0x008A4682 1197698]
Ordinal0 [0x00894096 1130646]
Ordinal0 [0x0086E636 976438]
Ordinal0 [0x0086F546 980294]
GetHandleVerifier [0x00C39612 2498066]
GetHandleVerifier [0x00C2C920 2445600]
GetHandleVerifier [0x00A64F2A 579370]
GetHandleVerifier [0x00A63D36 574774]
Ordinal0 [0x00961C0B 1973259]
Ordinal0 [0x00966688 1992328]
Ordinal0 [0x00966775 1992565]
Ordinal0 [0x0096F8D1 2029777]
BaseThreadInitThunk [0x75B9FA29 25]
RtlGetAppContainerNamedObjectPath [0x77C77A7E 286]
RtlGetAppContainerNamedObjectPath [0x77C77A4E 238]
CodePudding user response:
First of all: You forgot to put search = True
after printing the last post, so it would always skip the loop and print out the first post. But even then, you are not done since driver.find_element()
always searches for the first element matching your arguments, so it would get stuck in an endless loop because the first post is in the used
list and would set search
to True
endlessly.
Try to use driver.find_elements()
instead. This will create a list with all the posts, so you can just loop through the list and print each post like that:
posts = driver.find_elements(by=By.CSS_SELECTOR, value='.post-container a img')
for post in posts:
post_url = post.get_attribute('src')
post_title = post.get_attribute('alt')
used.append(post.id)
print(post_url)
print(post_title)
print('......')
print(used)
print(post.id)
time.sleep(2)
CodePudding user response:
The variable "post" has no context within which to be relative (value starts with a period). Since there is no description of the structure of the actual webpage, it is hard to determine the correct code you need.
I found these two YouTube clips to be instructive:
- How I use SELENIUM to AUTOMATE the Web with PYTHON. Pt1: https://www.youtube.com/watch?v=pUUhvJvs-R4
- How to SCRAPE DYNAMIC websites with Selenium: https://www.youtube.com/watch?v=lTypMlVBFM4