I'm trying to parse the follow page.
url="https://learnjapanesedaily.com/most-common-japanese-words.html/6/#1000-most-common-japanese-words-week-6";
Code:
private void Getwords(string html)
{
int new1 = 0;
Regex r1 = new Regex(@"<p>\d{1,3}.\s(.*?)\s:\s(.*?)<\/p>", RegexOptions.Multiline | RegexOptions.Singleline);
MatchCollection matches = r1.Matches(html);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
if (!wdic.ContainsKey(groups[1].Value))
{
wdic.Add(groups[1].Value, groups[2].Value);
new1 ;
listBox1.Items.Add(groups[1].Value);
}
}
Console.WriteLine("New words added = " new1.ToString());
}
When I run it, I'll get 61 matches instead of 60 matches because the last match will match a really long line. I only want it to check and parse 1~3 lines at most. Is it possible?
CodePudding user response:
The problem is that .*?
could also match " : " and newlines, and so if (in theory) you would get a string with many such occurrences (like " : : : : : : : : : : "), or it could even match </p>
across a long multiline input, trying to match something without success, and so there could be a lot of backtracking happening.
The solution to be more precise in what that the first .*?
should match: you could for instance require that it wouldn't match a colon, nor a line break.
Not the problem, but you seem to want to match a literal dot after the initial number, and then you should precede it with a backslash.
Here is a correction:
@"<p>\d{1,3}\.\s([^\n\r:]*?)\s:\s([^\n\r]*?)<\/p>"
On a more general note: it is a bad idea to parse HTML with regex. Parse the HTML using an HTML parser.