Home > OS >  How to count number of characters and words in an HTML file or HTML string?
How to count number of characters and words in an HTML file or HTML string?

Time:05-16

I have this string input from a HTML file:

<h1> Hello world </h1> 

I want to count number of word and character of this file (not include HTML element)

For example:

Input 

<h1>Hello</h1>\n<h1>Hello</h1>

Output

Characters : 10
Word : 2

I believe there will be a step we parse this HTML content first. But I dont know which package support that.

CodePudding user response:

You can find them by regular expression.

    input := []byte("<h1>Hello</h1>\n<h1>Hello</h1>")

    tags, _ := regexp.Compile("(\\<\\/?[A-z0-9] \\>)|(\\\\[A-z]{1})")
    // remove tags and backslash characters
    input = tags.ReplaceAll(input, []byte(" "))

    words, _ := regexp.Compile("[A-z0-9] ")
    // find all matched words and count them
    fmt.Println("total words: ", len(words.FindAll(input, -1)))

    chars, _ := regexp.Compile("[A-z0-9]{1}")
    // find all matched characters and count them
    fmt.Println("total characters: ", len(chars.FindAll(input, -1)))    

output:

total words:  2
total characters:  10
  •  Tags:  
  • go
  • Related