Home > Mobile >  How to get only human-visible text from page C# with HTMLAgilityPack?
How to get only human-visible text from page C# with HTMLAgilityPack?

Time:11-23

I tried to get only visible text from page, split it and return array of words on the page. My code:

public async Task<string[]> GetText(string link)
{
    string htmlSource = await httpClient.GetStringAsync(link);
    string text = "";
    page = new HtmlDocument();
    page.LoadHtml(htmlSource);
    IEnumerable<HtmlNode> nodes = page.DocumentNode.Descendants().Where(n =>
        n.NodeType == HtmlNodeType.Text &&
        n.ParentNode.Name != "script" &&
        n.ParentNode.Name != "style");
    foreach (HtmlNode node in nodes)
    {
        text  = node.InnerText;
    }
    Regex regex = new Regex(@"\W");
    text = text.ToLower();
    text = regex.Replace(text, " ");
    string[] result = text.Split(' ');

    return result;
}

enter image description here

My code makes it not good because it have merged words I think the problem is how i extract the text from the nodes, but I don't have any idea, how to fix it

CodePudding user response:

Return a List instead of an array. That lets you simplify the code in several places:

public async Task<List<string>> GetText(string link)
{
    string htmlSource = await httpClient.GetStringAsync(link);
    var result = new List<string>();
    page = new HtmlDocument();
    page.LoadHtml(htmlSource);
    var results = page.DocumentNode.Descendants().
        Where(n =>
            n.NodeType == HtmlNodeType.Text &&
            n.ParentNode.Name != "script" &&
            n.ParentNode.Name != "style"
        ).Select(n => n.InnerText.ToLower());

    return results.ToList();
}

Before the advent of async/await, I would instead advocate using IEnumerable, but the async story here isn't as nice yet (ie: the below code won't work as expected) and IAsyncEnumerable has some rough edges:

public async Task<IEnumerable<string>> GetText(string link)
{
    string htmlSource = await httpClient.GetStringAsync(link);
    page = new HtmlDocument();
    page.LoadHtml(htmlSource);

    return page.DocumentNode.Descendants().
        Where(n =>
            n.NodeType == HtmlNodeType.Text &&
            n.ParentNode.Name != "script" &&
            n.ParentNode.Name != "style"
        ).Select(n => n.InnerText.ToLower());
}

But I still think it's worth keeping an eye on this. There's another big level of performance to be gained when the async world also learns how to avoid needing to load the whole set in memory.

  • Related