I am trying to scrap Instagram by hash tag in this case dog using selenium
- scroll to load images
- get links of posts for loaded images
but I realized that most of the links are repeated (last 3 lines) I don't know what is the problem I even tried many libraries for Instagram scrapping but all of them either giving errors or don't search by hash tag.
I am trying to scrap Instagram to get image data for my Deep Learning classifier model
also I want to know if there are better methods for Instagram scraping
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains as AC
driver = webdriver.Edge("msedgedriver.exe")
driver.get("https://www.instagram.com")
tag = "dog"
numberOfScrolls = 70
### Login Section ###
time.sleep(3)
username_field = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[1]/div/label/input')
username_field.send_keys("myusername")
password_field = driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[2]/div/label/input')
password_field.send_keys("mypassword")
time.sleep(1)
driver.find_element_by_xpath('//*[@id="loginForm"]/div/div[3]').click()
time.sleep(5)
### Scarping Section ###
link = "https://www.instagram.com/explore/tags/" tag
driver.get(link)
time.sleep(5)
Links = []
for i in range(numberOfScrolls):
AC(driver).send_keys(Keys.END).perform() # scrolls to the bottom of the page
time.sleep(1)
for x in range(1, 8):
try:
row = driver.find_element_by_xpath(
'//*[@id="react-root"]/section/main/article/div[2]/div/div[' str(i) ']')
row = row.find_elements_by_tag_name("a")
for element in row:
if element.get_attribute("href") is not None:
print(element.get_attribute("href"))
Links.append(element.get_attribute("href"))
except:
continue
print(len(Links))
Links = list(set(Links))
print(len(Links))
CodePudding user response:
it found what was my mistake
row=driver.find_element_by_xpath('//[@id="reactroot"]/section/main/article/div[2]/div/div[' str(i) ']')
specifically in this part str(i)
it should be x instead of i thats why most of them where repeated