How do I avoid this catastrophic regex backtracking issue?-CodePudding

I'm trying to parse the follow page.

url="https://learnjapanesedaily.com/most-common-japanese-words.html/6/#1000-most-common-japanese-words-week-6";

Code:

private void Getwords(string html)
{
    int new1 = 0;
    Regex r1 = new Regex(@"<p>\d{1,3}.\s(.*?)\s:\s(.*?)<\/p>", RegexOptions.Multiline | RegexOptions.Singleline);
    MatchCollection matches = r1.Matches(html);
    foreach (Match match in matches)
    {
        GroupCollection groups = match.Groups;
        if (!wdic.ContainsKey(groups[1].Value))
        {
            wdic.Add(groups[1].Value, groups[2].Value);
            new1  ;
            listBox1.Items.Add(groups[1].Value);
        }
    }
    Console.WriteLine("New words added = "   new1.ToString());
}

When I run it, I'll get 61 matches instead of 60 matches because the last match will match a really long line. I only want it to check and parse 1~3 lines at most. Is it possible?

CodePudding user response：

The problem is that .*? could also match " : " and newlines, and so if (in theory) you would get a string with many such occurrences (like " : : : : : : : : : : "), or it could even match </p> across a long multiline input, trying to match something without success, and so there could be a lot of backtracking happening.

The solution to be more precise in what that the first .*? should match: you could for instance require that it wouldn't match a colon, nor a line break.

Not the problem, but you seem to want to match a literal dot after the initial number, and then you should precede it with a backslash.

Here is a correction:

@"<p>\d{1,3}\.\s([^\n\r:]*?)\s:\s([^\n\r]*?)<\/p>"

On a more general note: it is a bad idea to parse HTML with regex. Parse the HTML using an HTML parser.