Tweets scraping using Python selinum-CodePudding

I am trying to scrape tweets under a hashtag using Python selinum and I use the following code to scroll down driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')

The problem is that selinum only scrapes shown tweets (only 3 tweets) and then scroll down to the end of the page and load more tweets and scrape 3 new tweets missing a lot of tweets in between.

Is there a way to show all tweets and then scroll down and show all new tweets or at least some new tweets (I've a mechasm to filter already scraped rweets) ?

Note I'm running my script on GCP VM so I can't rotate the screen.

I think that I can make the script keeps pressing the down arrow by that I can display tweets one by one and scrape them and also keep loading more tweets, but I think that this will slow down the scraper so much.

CodePudding user response：

Scroll down the page by pixels, so the page will get the time to load the data, try the below code:

while True:
    driver.execute_script("window.scrollBy(0, 800);")  # you can increase or decrease the scrolling height, i.e - '800'
    sleep(1)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

CodePudding user response：

To scroll down page in selenium we need to write

driver.execute_script(
        "window.scrollTo("   str(data.location["x"])   ", "   str(data.location["y"])   ")")

Here data is the tweets that we get