I'm making a web scraper and i'm at the point where I need to parse the incoming data. Everything was going fine until I had to find all instances of a substring in a string. I was able to get something working but it doesn't give me the full string I want (which is a full <p></p>
tag).
done = 0;
while (done == 0) {
if ((findSpan = strstr(serverResp, "<p")) != NULL) {
printf("%s\n", findSpan);
if ((findSpanEnd = strstr(findSpan, "</p>")) != NULL) {
strcpy(serverResp, findSpanEnd);
strcpy(findSpanEnd 4, "");
printf("after end tag formattng %s\n", findSpan);
}
} else {
done = 1;
}
}
After end tag formatting should give me a result along the lines of <p>insert text here</p>
but instead, I get something like this:
<p>This should be printed</p>
<h3>ignore</h3>
<p>and so should this</p>
</body>
</html>
after end tag formatting <p>This should be printed</p>
<h3>ignore</h3>
<p>and so should this</p>
</body>
</html>
after end tag formatting dy>
</html>
The site's code looks like this:
<!DOCTYPE html>
<html>
<head></head>
<body>
<h1>ignore this</h1>
<p>This should be printed</p>
<h3>ignore</h3>
<p>and so should this</p>
</body>
</html>
CodePudding user response:
if ((findSpanEnd = strstr(findSpan, "</p>")) != NULL) {
strcpy(serverResp, findSpanEnd);
This makes no sense. strstr
finds "</p>"
as requested; however you can't pass that to strcpy
like that. strstr
doesn't allocate a new string at all; it only returns the location within the old one.
A routine to print out all <p>
tags would look like this (note that this assumes no nested <p>
tags):
for (char *ptr = serverResp; ptr = strstr(ptr, "<p");)
{
char *finger = strchr(ptr, '>');
if (!finger) break;
finger;
ptr = strstr(finger, "</p>");
if (!ptr) {
fwrite(finger, 1, strlen(finger), stdout);
} else {
fwrite(finger, 1, ptr - finger, stdout);
}
fputs("\r\n", stdout);
}
The technique: the call to strstr
in the for loop locates the next <p>
tag, strchr
finds the end of it, then another strstr
finds the closing </p>
Because the return pointers are into the originating string, we use fwrite
instead of printf
to produce output.