Home > Software engineering >  How to Fetch Webpage Through TCP socket using HTTP Request in JAVA
How to Fetch Webpage Through TCP socket using HTTP Request in JAVA

Time:12-28

This is my college Assignment to Fetch a WebPage From Any Web server By URL Using TCP Socket And HTTP "GET" Request. And I am not Getting HTTP/1.0 200 OK Response From Any Server

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.net.InetAddress;
import java.net.Socket;
import java.net.URL;
import java.util.Scanner;
import java.net.*;
public class DCCN042 {

    public static void main(String[] args) {
            Scanner inpt = new Scanner(System.in);
                System.out.print("Enter URL: ");
                String url = inpt.next();
                TCPConnect(url); 
            }
   public static void TCPConnect(String url) {
        try {
            String hostname = new URL(url).getHost();
            System.out.println("Loading contents of Server: "   hostname);
            InetAddress ia = InetAddress.getByName(hostname);
            String ip = ia.getHostAddress();
            System.out.println(ip   " is IP Adress for  "   hostname);
            String path = new URL(url).getPath();
            System.out.println("Requested Path on the server: "   path);
            Socket socket = new Socket(ip, 80);
            // Create input and output streams to read from and write to the server
            PrintStream out = new PrintStream(socket.getOutputStream());
            BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
            // Follow the HTTP protocol of GET <path> HTTP/1.0 followed by an empty line
            if (hostname ! = url) {
                //Request Line
                out.println("GET "   path   " HTTP/1.1");
                out.println("Host: "   hostname);
                //Header Lines
                out.println("User-Agent: Java/13.0.2");
                out.println("Accept-Language: en-us");
                out.println("Accept: */*");
                out.println("Connection: keep-alive");
                out.println("Accept-Encoding: gzip, deflate, br");
                // Blank Line
                out.println();
            } else {
                //Request Line
                out.println("GET / HTTP/1.0");
                out.println("Host: "   hostname);
                //Header Lines
                out.println("User-Agent: Java/13.0.2");
                out.println("Accept-Language: en-us");
                out.println("Accept: */*");
                out.println("Connection: keep-alive");
                out.println("Accept-Encoding: gzip, deflate, br");
                // Blank Line
                out.println();
            }
            // Read data from the server until we finish reading the document
            String line = in.readLine();
            while (line != null) {
                System.out.println(line);
                line = in.readLine();
            }
            // Close our streams
            in.close();
            out.close();
            socket.close();
        } catch (Exception e) {
            System.out.println("Invalid URl");
            e.printStackTrace();
        }
    }
}

I Create TCP Socket And pass the IP Address that received from InetAddress Library Method getHostAddress() and port "80" for the web server and use getPath() and getHost() to separate path and hostname from URL and Use Same Path and hostname in HTTP GET request And Response from Server:

Enter URL: https://stackoverflow.com/questions/33015868/java-simple-http-get-request-using-tcp-sockets
    Loading contents of Server: stackoverflow.com
    151.101.65.69 is IP Adress for  stackoverflow.com
    Requested Path on the server: /questions/33015868/java-simple-http-get-request-using-tcp-sockets
    HTTP/1.1 301 Moved Permanently
    cache-control: no-cache, no-store, must-revalidate
    location: https://stackoverflow.com/questions/33015868/java-simple-http-get-request-using-tcp-sockets
    x-request-guid: 5f2af765-40c2-49ca-b9a1-daa321373682
    feature-policy: microphone 'none'; speaker 'none'
    content-security-policy: upgrade-insecure-requests; frame-ancestors 'self' https://stackexchange.com
    Accept-Ranges: bytes
    Transfer-Encoding: chunked
    Date: Mon, 27 Dec 2021 15:00:17 GMT
    Via: 1.1 varnish
    Connection: keep-alive
    X-Served-By: cache-qpg1263-QPG
    X-Cache: MISS
    X-Cache-Hits: 0
    X-Timer: S1640617217.166650,VS0,VE338
    Vary: Fastly-SSL
    X-DNS-Prefetch-Control: off
    Set-Cookie: prov=149aa0ef-a3a6-8001-17c1-128d6d4b7273; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
    
    0

My Requirement is to get HTML Source code of this Webpage. And HTTP/1.0 200 OK Response

CodePudding user response:

This is happening because you are using a plain Socket with a hardcoded port 80. This means that, independently of using a http or https url in your input, you are requesting via the unsecure protocol http.

In this situation, the server is telling you, as Samuel L. Jackson would say "hey mother fucker! you are trying to reach me through a fucking unsecure protocol, fucking HTTP. Use a secure one mother fucker, the fuck HTTPS.", and so, it responds with 301 (which just means "use this url, not the original one"), with the Location header pointing to the correct URL, the https one.

So apparently the 301 Location is the same URL, but it's not, because in your code you are hardcoding http, and the server response is redirecting to https.

To make your code work with https, instead of a plain Socket use this:

SSLSocketFactory factory = (SSLSocketFactory)SSLSocketFactory.getDefault();
SSLSocket socket = (SSLSocket)factory.createSocket(ia, 443);

Do note that I'm not using the ip, because for https you need that the certificate corresponds to the domain, if you use the IP you will get a CertificateExpiredException.

Now, whether to use Socket or SSLSocket is something that you will have to manage programatically depending on the user input.

  • Related