Home > Blockchain >  selenium Instagram scraper duplication
selenium Instagram scraper duplication

Time:12-31

I am trying to scrap Instagram by hash tag in this case dog using selenium

  1. scroll to load images
  2. get links of posts for loaded images

but I realized that most of the links are repeated (last 3 lines) I don't know what is the problem I even tried many libraries for Instagram scrapping but all of them either giving errors or don't search by hash tag.
I am trying to scrap Instagram to get image data for my Deep Learning classifier model also I want to know if there are better methods for Instagram scraping

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains as AC

driver = webdriver.Edge("msedgedriver.exe")
driver.get("https://www.instagram.com")

tag = "dog"
numberOfScrolls = 70

### Login Section ###

time.sleep(3)
username_field = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[1]/div/label/input')
username_field.send_keys("myusername")

password_field = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[2]/div/label/input')
password_field.send_keys("mypassword")
time.sleep(1)

driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]').click()
time.sleep(5)

### Scarping Section ###

link = "https://www.instagram.com/explore/tags/"   tag
driver.get(link)
time.sleep(5)
Links = []
for i in range(numberOfScrolls):
    AC(driver).send_keys(Keys.END).perform()  # scrolls to the bottom of the page
    time.sleep(1)
    for x in range(1, 8):
        try:
            row = driver.find_element_by_xpath(
                '//*[@id="react-root"]/section/main/article/div[2]/div/div['   str(i)   ']')
            row = row.find_elements_by_tag_name("a")
            for element in row:
                if element.get_attribute("href") is not None:
                    print(element.get_attribute("href"))
                    Links.append(element.get_attribute("href"))
        except:
            continue

print(len(Links))
Links = list(set(Links))
print(len(Links))

CodePudding user response:

it found what was my mistake

row=driver.find_element_by_xpath('//[@id="reactroot"]/section/main/article/div[2]/div/div[' str(i) ']')
specifically in this part str(i) it should be x instead of i thats why most of them where repeated

  • Related