Home > front end >  Python Socket only returns Response header instead of HTML
Python Socket only returns Response header instead of HTML

Time:09-07

I want to extract links from a website js. Using sockets, I'm trying to get the web JS but it always shows response header and not an actual JS/HTML. Here's what I'm using:

import socket
import ssl

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
cont = ssl.create_default_context()
sock.connect(('blog.clova.line.me', 443))
sock = cont.wrap_socket(sock, server_hostname = 'blog.clova.line.me')
sock.sendall('GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1\r\nHost: blog.clova.line.me\r\n\r\n'.encode())
resp = sock.recv(2048)
print(resp.decode('utf-8'))

It returns only response header:

HTTP/1.1 200 OK
Date: Tue, 06 Sep 2022 12:02:38 GMT
Content-Type: application/javascript
Transfer-Encoding: chunked
Connection: keep-alive
CF-Ray: 74670e8b9b594c2f-SIN
Age: 3444278
...

I have tried the following:

  1. Setting Content-Type: text/plain; charset=utf-8 header
  2. Changing the header to GET https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1

Have been searching related, it's seems that: other people is able to achieve HTML data after response header are received, but for me; I only able to receive the headers and not the HTML data. Frankly, it's working on requests:

resp = requests.get('https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js')
print(resp.text)

How can I achieve similar result using socket? Honestly, I don't like using 3rd-party module that's why I'm not using requests.

CodePudding user response:

The response is just truncated: sock.recv(2048) is reading just the first 2048 bytes. If you read more bytes, you will see the body after the headers.

Anyway, I wouldn't recommend doing that using such a low level library.

Honestly, I don't like using 3rd-party module that's why I'm not using requests.

If your point is to stick to the python standard library, you can use urrlib.request which provides more abstraction than socket:

import urllib
req = urllib.request.urlopen('…')
print(req.read())

CodePudding user response:

From documentation:

Now we come to the major stumbling block of sockets - send and recv operate on the network buffers. They do not necessarily handle all the bytes you hand them (or expect from them), because their major focus is handling the network buffers. In general, they return when the associated network buffers have been filled (send) or emptied (recv). They then tell you how many bytes they handled. It is your responsibility to call them again until your message has been completely dealt with.

I've re-write your code and added a receive_all function, which handles the received bytes: (Of course it's a naive implementation)

import socket
import ssl

request_text = (
    "GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js "
    "HTTP/1.1\r\nHost: blog.clova.line.me\r\n\r\n"
)

host_name = "blog.clova.line.me"


def receive_all(sock):
    chunks: list[bytes] = []
    while True:
        chunk = sock.recv(2048)
        if not chunk.endswith(b"0\r\n\r\n"):
            chunks.append(chunk)
        else:
            break
    return b"".join(chunks)



cont = ssl.create_default_context()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
    sock.settimeout(5)
    with cont.wrap_socket(sock, server_hostname=host_name) as ssock:
        ssock.connect((host_name, 443))
        ssock.sendall(request_text.encode())

        resp = receive_all(ssock)
        print(resp.decode("utf-8"))
  • Related