I tried to get only visible text from page, split it and return array of words on the page. My code:
public async Task<string[]> GetText(string link)
{
string htmlSource = await httpClient.GetStringAsync(link);
string text = "";
page = new HtmlDocument();
page.LoadHtml(htmlSource);
IEnumerable<HtmlNode> nodes = page.DocumentNode.Descendants().Where(n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes)
{
text = node.InnerText;
}
Regex regex = new Regex(@"\W");
text = text.ToLower();
text = regex.Replace(text, " ");
string[] result = text.Split(' ');
return result;
}
My code makes it not good because it have merged words I think the problem is how i extract the text from the nodes, but I don't have any idea, how to fix it
CodePudding user response:
Return a List instead of an array. That lets you simplify the code in several places:
public async Task<List<string>> GetText(string link)
{
string htmlSource = await httpClient.GetStringAsync(link);
var result = new List<string>();
page = new HtmlDocument();
page.LoadHtml(htmlSource);
var results = page.DocumentNode.Descendants().
Where(n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style"
).Select(n => n.InnerText.ToLower());
return results.ToList();
}
Before the advent of async/await, I would instead advocate using IEnumerable, but the async story here isn't as nice yet (ie: the below code won't work as expected) and IAsyncEnumerable has some rough edges:
public async Task<IEnumerable<string>> GetText(string link)
{
string htmlSource = await httpClient.GetStringAsync(link);
page = new HtmlDocument();
page.LoadHtml(htmlSource);
return page.DocumentNode.Descendants().
Where(n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style"
).Select(n => n.InnerText.ToLower());
}
But I still think it's worth keeping an eye on this. There's another big level of performance to be gained when the async world also learns how to avoid needing to load the whole set in memory.