Home > Net >  What is the fastest/ most lightweight way of getting html after javascript have excuted?
What is the fastest/ most lightweight way of getting html after javascript have excuted?

Time:07-29

The problem is that youtube API for searching is very limiting, so i've resorted to webscraping the search result page. So far i've tried to use seleiunm to load the page and get the html, but it have quite a bit of delay when starting up.

Without Javascript, youtube search result page will not get generated properly, so I cant just run a get request on the URL.

Is there any other ways to get the rendered search result page?

My code right now

    def search(self, query):
        try:

            self.driver.get('https://www.youtube.com/results?search_query={}'.format(str(query)))

            self.wait.until(self.visible((By.ID, "video-title")))
            elements=self.driver.find_elements(By.XPATH,"//*[@id=\"video-title\"]")
            results = []
            for element in elements:
                results.append([element.text, element.get_attribute('href')])
            return results
        except:
            return []

This is part of a class that reuses the same seleiunm instance until the program shuts down

CodePudding user response:

The fastest way with selenium is to use "eager" page load strategy and wait for the selector.

But in my experience you can probably do around 2x faster by switching to playwright (async)

CodePudding user response:

If you proceed to curl https://www.youtube.com/results?search_query=test, you will realize that the results data you are looking for are part of the JavaScript variable ytInitialData. I would recommend you to just fetch this HTML file and parse its JavaScript variable ytInitialData. Likewise you don't need to use any JavaScript interpreter such as Selenium that is particularly slow as it isn't required.

Note: I am developing an open-source alternative to the YouTube Data API v3 using this method. I have an endpoint similar to what you are looking for by the way.

  • Related