Home > Back-end >  InputStream.openStream() gives back non html code when it is read by BufferedReader object
InputStream.openStream() gives back non html code when it is read by BufferedReader object

Time:12-10

I know the title is a little confusing but basically I'm trying to scrape the html from a youtube video to get its view count and I'm using InputStream.openStream() to do so but when I do that it give me back this code which I've pasted into a notepad so that it can be seen easier but when I try to search for it in the websites html with inspect element its no where to be found and when I try to find it with String.contains() it also doesn't give any result and I was wondering how I could get the html while still using InputStream and BufferedReader if at all possible

Here's my code

package Project;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Scanner;
import java.io.File;

public class WebScraper
{

  public static void main(String[] args) throws IOException
    {
      
      URL website = new URL("https://www.youtube.com/watch?v=9h5JC-GLR6g");
      BufferedReader in = new BufferedReader(new InputStreamReader(website.openStream()));
      
      int c = 0;
      ArrayList<String> code = new ArrayList<String>();

      while(in.readLine() != null)
      {
        String s = (in.readLine());
        if(s!=null)
        {
          Scanner lineScan = new Scanner(s);
          
          while(lineScan.hasNextLine())
          {
            code.add(lineScan.nextLine());
            c  ;
          }
        }
      }
      
      int i = 0;
      while(i < code.size())
      {
          if(code.get(i) != null && code.get(i).contains("896K views"))
              System.out.println(i " " code.get(i));

          System.out.println(code.get(i));
          i  ;
      }
      //System.out.println(code);
      System.out.println("Lines of code: "  c);
  
  }

}

CodePudding user response:

That is HTML. The <script> tag, specifically.

A browser will download a bunch of HTML and then 'render' it. If <script> tags are part of the HTML (and right now, in 2022? script tags are... they are everywhere), they will be executed. And that javascript can and generally will make some calls on its own, get some JSON back or whatnot, and then create all sorts of HTML elements on the fly and inject that into the page.

When you use 'inspect element', you see the state of the page (the 'DOM') as it is after all that javascript runs.

The only way to get from 'the HTML JS CSS that the server sent to me' to 'the DOM as inspect element shows it' is to run all that javascript.

Which is incredibly complicated. You pretty much need a browser to do this.

Hence, generally, what you want just does not work. This is exactly why services like youtube have APIs. Because trying to 'read' the pages intended for human eyeballs is borderline impossible, and even if you manage it (there are hacks, very complicated ones), if youtube restyles a few things - poof, there goes your app.

The hack, which you really shouldn't use, is to actually fire up a browser, but control that browser from your java app and ask it to stream the DOM that you can see with 'inspect element', after running all that javascript, back to your java process. This exists - but it is intended to test your client side stuff: Selenium. Because it really runs a browser it is incredibly inefficient, considering (i.e. if you want to simultaneously parse through 1000 youtube links, you better have one heck of a beefy box to run 1000 browsers concurrently, because that's effectively what you end up doing).

So, given that this is way too complicated, there's really only one thing you can do:

Search the web for 'youtube api'. If there isn't one, you're done. What you want is not possible unless you are willing to hack it, spend every week updating it as youtube changes what it looks like, fight legal battles, and in general spend weeks developing it all, not to mention become an expert because none of this is easy at all. If there is an API, that's good news: Read all about it, and use that to figure this stuff out instead of trying to HTML-parse youtube.com itself.

CodePudding user response:

This is because many sites (YouTube included) serve the pages that are dynamically rendered by the browser and do not contain the HTML that you are looking for. You will need to use Selenium or some other tool for your use case.

  • Related