My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape
in JavaScript.
I know there are libraries in C to parse html.
But is there some easy way to convert html escape sequences like &
or !
to just &
and !
?
CodePudding user response:
Just wrote and tested a version that does this (crudely). Didn't take long.
You'll want something like this:
typedef struct {
int gotLen; // save myriad calls to strlen()
char *got;
char *want;
} trx_t;
trx_t lut[][2] = {
{ 5, "&", "&" },
{ 5, "!", "!" },
{ 8, "†", "*" },
};
const int nLut = sizeof lut/sizeof lut[0];
And then a loop with two pointers that copies characters within the same buf, sniffing for the '&' that triggers a search of the replacement table. If found, copy the replacement string to the destination and advance the source pointer to skip past the HTML token. If not found, then the LUT may need additional tokens.
Here's a beginning...
void replace( char *buf ) {
char *pd = buf, *ps = buf;
while( *ps )
if( *ps != '&' )
*pd = *ps ;
else {
// EDIT: Credit @Craig Estey
if( ps[1] == '#' ) {
if( ps[2] == 'x' || ps[2] == 'X' ) {
/* decode hex value and save as char(s) */
} else {
/* decode decimal value and save as char(s) */
}
/* advance pointers and continue */
}
for( int i = 0; i < nLut; i )
/* not giving it all away */
/* handle "found" and "not found" in LUT *
}
*pd = '\0';
}
This was the test program
int main() {
char str[] = "The fox & hound† went for a walk! & chat.";
puts( str );
replace( str );
puts( str );
return 0;
}
and this was the output
The fox & hound† went for a walk! & chat.
The fox & hound* went for a walk! & chat.
The "project" is to write the interesting bit of the code. It's not difficult.
Caveat: Only works when substitution length is shorter or equal to target length. Otherwise need two buffers.
CodePudding user response:
This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:
What's the easiest way to escape HTML in Python?
How do you call Python code from C code?
But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:
parseFile()
while not EOF
ch = readNextCharacter()
if ch == '\'
readNextCharacter()
elseif ch == '&'
readEscapeSequence()
else
output = ch
readEscapeSequence()
seq = ""
ch = readNextCharacter();
while ch != ';'
seq = ch
ch = readNextCharacter();
replace = lookupEscape(seq)
output = replace
Note that this is only pseudo code to get you started