Home > Software design >  Regex doesn't read multiple lines correctly in C#
Regex doesn't read multiple lines correctly in C#

Time:02-05

i faced few problems trying to capture info from text file via REGEX in C#.

Here is an example of the code and the string:

string pattern = @"([\w\d] ) Hand #([\d] ): Tournament #([\d] ), ([$€])([.\d] )\ [$€]([.\d] ). \s\(([\d] \/[\d] )\). ([\d] \/[\d] \/[\d]  )([\d:] ) ET"  
                                @"(\n|\r|\r\n)Table '([\s\d] )' (. ) Seat #(\d) is the button"  
                                @"((?:(?:\n|\r|\r\n)^Seat ([\d] ): (. ) \(([\d] ) in chips\))*)";

MatchCollection matches = Regex.Matches(_hh, pattern, RegexOptions.Multiline);
PokerStars Hand #232702710836: Tournament #3332581238, $9.22 $0.78 USD Hold'em No Limit - Level IV (40/80) - 2021/12/31 22:34:19 ET
Table '3332581238 1' 9-max Seat #3 is the button
Seat 1: mpolishuk (2018 in chips)
Seat 3: Kevin2049 (1154 in chips)
Seat 4: IPray2Buddha (1030 in chips)
Seat 5: Sakura2892 (1499 in chips)
Seat 7: Lillien66 (2141 in chips)
Seat 9: owlie45 (5658 in chips)
mpolishuk: posts the ante 10
Kevin2049: posts the ante 10
IPray2Buddha: posts the ante 10
Sakura2892: posts the ante 10
Lillien66: posts the ante 10
owlie45: posts the ante 10
IPray2Buddha: posts small blind 40
Sakura2892: posts big blind 80
*** HOLE CARDS ***
Dealt to IPray2Buddha [6c 6h]
Lillien66: folds
owlie45: folds
mpolishuk: folds
Kevin2049: folds
IPray2Buddha: calls 40
Sakura2892: raises 1409 to 1489 and is all-in
IPray2Buddha: calls 940 and is all-in
Uncalled bet (469) returned to Sakura2892
*** FLOP *** [8d 7c Ts]
*** TURN *** [8d 7c Ts] [6s]
*** RIVER *** [8d 7c Ts 6s] [Ah]
*** SHOW DOWN ***
IPray2Buddha: shows [6c 6h] (three of a kind, Sixes)
Sakura2892: shows [Kd Qh] (high card Ace)
IPray2Buddha collected 2100 from pot
*** SUMMARY ***
Total pot 2100 | Rake 0
Board [8d 7c Ts 6s Ah]
Seat 1: mpolishuk folded before Flop (didn't bet)
Seat 3: Kevin2049 (button) folded before Flop (didn't bet)
Seat 4: IPray2Buddha (small blind) showed [6c 6h] and won (2100) with three of a kind, Sixes
Seat 5: Sakura2892 (big blind) showed [Kd Qh] and lost with high card Ace
Seat 7: Lillien66 folded before Flop (didn't bet)
Seat 9: owlie45 folded before Flop (didn't bet)
  1. Regex doesn't recognize \n , ^ , $ depside that RegexOptions.Multiline is enabled;
  2. It's reading only the first occurrence of repeating expression, tried to use both "*" or just copy the same expression without * 2 times, in both ways it's reading just the first occurrence.

CodePudding user response:

Perhaps there is a misunderstanding of groups and captures and what they contain. The RegEx appears to be trying to get data from the "Seat ... in chips" lines. But parts of these are enclosed in non-capturing groups. However, the main values from these lines are captured (see the output, shown below, from the captures).

Using the RegEx in the question and the code below, where string input is initialised to the multi-line text shown in the question.

MatchCollection matches = Regex.Matches(input, pattern, RegexOptions.Multiline);
        
for (int ii = 0; ii < matches.Count; ii  )
{
    Console.WriteLine("Match[{0}]  // of 0..{1}:", ii, matches.Count - 1);
    DisplayMatchResults(matches[ii]);
}

Gives the lines below (there is much more output) from the "Seat ... in chips" lines. Note that function DisplayMatchResults is taken from this StackOverflow answer.

    match.Groups[15].Captures[0].Value == "1"
    match.Groups[15].Captures[1].Value == "3"
    match.Groups[15].Captures[2].Value == "4"
    match.Groups[15].Captures[3].Value == "5"
    match.Groups[15].Captures[4].Value == "7"
    match.Groups[15].Captures[5].Value == "9"
    match.Groups[16].Captures[0].Value == "mpolishuk"
    match.Groups[16].Captures[1].Value == "Kevin2049"
    match.Groups[16].Captures[2].Value == "IPray2Buddha"
    match.Groups[16].Captures[3].Value == "Sakura2892"
    match.Groups[16].Captures[4].Value == "Lillien66"
    match.Groups[16].Captures[5].Value == "owlie45"
    match.Groups[17].Captures[0].Value == "2018"
    match.Groups[17].Captures[1].Value == "1154"
    match.Groups[17].Captures[2].Value == "1030"
    match.Groups[17].Captures[3].Value == "1499"
    match.Groups[17].Captures[4].Value == "2141"
    match.Groups[17].Captures[5].Value == "5658"

Note that the regex is overcomplicated. [\w\d] is the same\w. [\d] is the same as \d . There is no need to escape /, so replace \/ with /. The dates and times are treated differently, namely ([\d] \/[\d] \/[\d] ) versus ([\d:] ). Perhaps the date be simplified to ([\d/] )? Also, does the space need to be captured as part of the date? When matching linebreaks I normally use `[\r\n] , unless the specific pattern of CRs and LFs is important. There are lots of capture groups in the Regex, are they all needed? Note that changing the RexEx by adding or removing captures will mean that the numbers of all subsequent groups will change.

CodePudding user response:

Seems like C# reads the file with multiples \r\n at some lines which was causing the problem. Changing the new line expression to (\r\n)* solved the problem.

  • Related