With a bit of Python knowledge I tried to do some LinkedIn company posts scraping.
Using the code below, that I took from this website, all posts on a company's LinkedIn page are found first before its contents are extracted. The issue is that I know, I've counted, that there are more posts than that the findAll
function returns, regardless of which of the parsers lxml
, html5lib
or html.parser
I use. In one case, it returns 43 out of 67 posts, in another case it returns 10 out of 14. Typically, it finds about 3 or 4, then it skips 4 or 5 posts, then it finds a few again, etc.
How can I find out why this is happening?
#!/usr/bin/env python
# coding: utf-8
# Import
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Get credentials to log in to LinkedIn
username = input('Enter your linkedin username: ')
password = input('Enter your linkedin password: ')
company_name = input('Name of the company: ')
# Access Webdriver
s=Service(ChromeDriverManager().install())
browser = webdriver.Chrome(service=s)
browser.maximize_window()
# Define page to open
page = "https://www.linkedin.com/company/{}/posts/?feedView=all".format(company_name)
# Open login page
browser.get('https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')
# Enter login info:
elementID = browser.find_element_by_id('username')
elementID.send_keys(username)
elementID = browser.find_element_by_id('password')
elementID.send_keys(password)
elementID.submit()
# Go to webpage
browser.get(page 'posts/')
# Define scrolling time
SCROLL_PAUSE_TIME = 1.5
# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")
# Scroll all the way to the bottom of the page
while True:
# Scroll down to bottom
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Get content of page
content = browser.page_source.encode('utf-8').strip()
# Create soup
linkedin_soup = bs(content, "html5lib")
linkedin_soup.prettify()
# Find entities that contain posts
containers = linkedin_soup.findAll("div",{"class":"occludable-update ember-view"})
CodePudding user response:
The issue is when you scroll down to the bottom, it sort of skips some of the posts to render. There likely a better way to do this, but basically I have scroll 1/4 of the way, then 1/2, then full (hoping to catch all the posts). Try this adjustment:
# Scroll all the way to the bottom of the page
while True:
# Scroll down to bottom
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/4);")
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
CodePudding user response:
So @chitown88 got me on the right track, this is the final code I have now which gets me the result I need:
# Define scrolling height and time
SCROLL_PAUSE_TIME = 1.5 # [sec]
SCROLL_HEIGHT = 1000
# Pause to be sure page is loaded
time.sleep(SCROLL_PAUSE_TIME)
# Scroll all the way to the bottom of the page
new_height = SCROLL_HEIGHT
while True:
# Get maximal scroll height
max_height = browser.execute_script("return document.body.scrollHeight")
# Check whether maximal scroll height has been exceeded
if new_height > max_height:
break
# Scroll to position
browser.execute_script("window.scrollTo(0, {});".format(new_height))
time.sleep(SCROLL_PAUSE_TIME)
# Get current scroll position
#current_height = browser.execute_script("return window.pageYOffset")
# Increase scroll position
new_height = new_height SCROLL_HEIGHT
# Make sure to reach last position
browser.execute_script("window.scrollTo(0, {});".format(max_height))
I left in the current_height
variable, not sure if I'll need it again, this code needs some more verification. Useful to save maybe.