I have this string input from a HTML file:
<h1> Hello world </h1>
I want to count number of word and character of this file (not include HTML element)
For example:
Input
<h1>Hello</h1>\n<h1>Hello</h1>
Output
Characters : 10
Word : 2
I believe there will be a step we parse this HTML content first. But I dont know which package support that.
CodePudding user response:
You can find them by regular expression.
input := []byte("<h1>Hello</h1>\n<h1>Hello</h1>")
tags, _ := regexp.Compile("(\\<\\/?[A-z0-9] \\>)|(\\\\[A-z]{1})")
// remove tags and backslash characters
input = tags.ReplaceAll(input, []byte(" "))
words, _ := regexp.Compile("[A-z0-9] ")
// find all matched words and count them
fmt.Println("total words: ", len(words.FindAll(input, -1)))
chars, _ := regexp.Compile("[A-z0-9]{1}")
// find all matched characters and count them
fmt.Println("total characters: ", len(chars.FindAll(input, -1)))
output:
total words: 2
total characters: 10