Home > Blockchain >  Find out hyperlinks in HTML response using C without external libraries
Find out hyperlinks in HTML response using C without external libraries

Time:11-26

I need to find out all the hyperlinks in the form of <a href="xxxxxx"> in a HTML response without using external libraries. Hence, I use strtok(message, " \n<>") to split response into pieces. If the last piece is a and the current piece start with href, we find one link. However, the code does not perform well, producing memory problem.

VS code keeps popping up messages with text like this:

Unable to open 'strlen.S': Unable to read file '/build/glibc-S7Ft5T/glibc-2.23/sysdeps/x86_64/strlen.S' (Error: Unable to resolve nonexistent file '/build/glibc-S7Ft5T/glibc-2.23/sysdeps/x86_64/strlen.S').

And in the terminal, it prints out:

[1] Done "/usr/bin/gdb" --interpreter=mi --tty=${DbgTerm} 0<"/tmp/Microsoft-MIEngine-In-fjtfcdmp.4qw" 1>"/tmp/Microsoft-MIEngine-Out-hlygokwk.gwa"

Here is my code:

// Omit the sending and receiving part
//
if(recv(tcpSocket, response_buf, sizeof(response_buf), 0) < 0){
        printf("Error with recv()\n");
    }
else {
        printf("Successfully receive response\n");
        // printf("Message received:\n %s\n", response_buf);
}
printf("===============================\n");
char * const dupstr = strdup(response_buf);
char* token = NULL;
token =  strtok(dupstr, " \n<>");

char last_token[2000];
char token_head[1000];
char a[] = "a";
char href[] = "href=";
while (token != 0) {
    strcpy(last_token, token);
    token = strtok(NULL, " \n<>");
    if(strcmp(last_token, a) == 0){
        size_t len = strlen(token);
        if(len>5){
            strncpy(token_head, token, 5);
            if(strcmp(token_head, href)== 0){
                printf("%s\n", token);
            }
        }
    }
}

If you need the whole source code for testing,

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <errno.h>
#include <netdb.h>
#include <arpa/inet.h>
#define PORT 80

int main(){

    // char hostname[100];
    char ip[20];
    struct hostent* h;
    struct in_addr **addr_list;

    struct sockaddr_in serverAddr;
    char request[2000];
    char response_buf[10000];

    char hostname[] = "URL"; //enter hostname

    /* Create socket */
    int tcpSocket = socket(AF_INET, SOCK_STREAM, 0);
    if(tcpSocket < 0) {
        printf("ERROR opening socket\n");
        exit(-1);
    } else {
        printf("Socket opened successfully.\n");
    }

    /* use gethostbybname() to perform DNS lookup */
    if ((h = gethostbyname(hostname)) == NULL) {
        fprintf(stderr, "gethostbybname(): error");
        exit(-1);
    }
    printf("Get IP success!\n");
    /* Connect to server */
    addr_list = (struct in_addr **) h->h_addr_list; 
    printf("IP: %s\n", inet_ntoa(*addr_list[0]));
    serverAddr.sin_addr.s_addr = inet_addr(inet_ntoa(*addr_list[0]));
    serverAddr.sin_family = AF_INET;
    serverAddr.sin_port = htons(PORT);

    if(connect(tcpSocket, (struct sockaddr *) &serverAddr, sizeof(serverAddr)) < 0) {
        printf("Can't connect\n");
        exit(-1);
    }
    else{
        printf("Connected successfully\n");
    }

    /* Create request message */
    bzero(request, 2000);
    sprintf(request, "GET / HTTP/1.1\r\nHost: %s\r\n\r\n", hostname);
    printf("\nOur request: %s", request);

    if(send(tcpSocket, request, strlen(request), 0) < 0) {
        printf("Error with send()");
    }
    else {
        printf("Successfully sent html fetch request\n");
    }

    if(recv(tcpSocket, response_buf, sizeof(response_buf), 0) < 0){
        printf("Error with recv()\n");
    }
    else {
        printf("Successfully receive response\n");
        // printf("Message received:\n %s\n", response_buf);
    }
    printf("===============================\n");
    char * const dupstr = strdup(response_buf);
    char* token = NULL;
    token =  strtok(dupstr, " \n<>");

    char last_token[2000];
    char token_head[1000];
    char a[] = "a";
    char href[] = "href=";
    while (token != 0) {
        strcpy(last_token, token);
        token = strtok(NULL, " \n<>");
        if(strcmp(last_token, a) == 0){
            size_t len = strlen(token);
            if(len>5){
                strncpy(token_head, token, 5);
                if(strcmp(token_head, href)== 0){
                    printf("%s\n", token);
                }
            }

        }
    }
    return 0;
}

CodePudding user response:

You are trying to single step while debugging, and you try to step into library functions.

In your posted case you try to step into strlen(). Commonly the source files for such functions are not included in an installation. That's why your debugger tells you that it cannot find the source file.

If you are debugging your program with single stepping, use "next" to step over function calls of library functions.

CodePudding user response:

The main problem I can see is with your code not properly checking the return value of strtok. You only check the value of token at the begin of the loop, but then call strtok. If it returns NULL, you won't find out until after the rest of the loop has happened.

You want to move the strtok to the end of the loop like this

while (token != 0) {
    if(strcmp(last_token, a) == 0){
        size_t len = strlen(token);
        if(len>5){
            strncpy(token_head, token, 5);
            if(strcmp(token_head, href)== 0){
                printf("%s\n", token);
            }
        }

    }
    strcpy(last_token, token);
    token = strtok(NULL, " \n<>");
}

...so that the check actually does what it is meant to do.

Or an alternative way of doing the loop is to use a for loop

for(token = strtok(dupstr, " \n<>"); token != NULL; token = strtok(NULL, " \n<>")) {
    if(strcmp(last_token, a) == 0){
        size_t len = strlen(token);
        if(len>5){
            strncpy(token_head, token, 5);
            if(strcmp(token_head, href)== 0){
                printf("%s\n", token);
            }
        }
    strcpy(last_token, token);
    }

Either way, you do need to properly initialise last_token as until that call to strcpy it could contain anything.

You also don't ever free dupstr. Whilst it isn't an issue at the moment since it'll be handled when the code exits, it is always a good habit to get into adding the corresponding call to free whenever you allocate memory.

  • Related