I am looking for a C# regex solution to match/capture some small but complex chunks of data. I have thousands of unstructured chunks of data in my database (comes from a third-party data store) that look similar to this:
not BATTCOMPAR{275} and FORKCARRIA{ForkSpreader} and SIDESHIFT{WithSSPassAttachCenterLine} and TILTANGLE{4up_2down} and not AUTOMATSS{true} and not FORKLASGUI{true} and not FORKCAMSYS{true} and OKED{true}
I want to be able to split that up into discrete pieces (regex match/capture) like the following:
not BATTCOMPAR{275}
and FORKCARRIA{ForkSpreader}
and SIDESHIFT{WithSSPassAttachCenterLine}
and TILTANGLE{4up_2down}
and not AUTOMATSS{true}
and not FORKLASGUI{true}
and not FORKCAMSYS{true}
and OKED{true}
The data will always conform to the following rules:
- At the end of each chunk of data there will be a string enclosed by curly braces, like this:
{275}
- The "curly brace grouping" will always come at the end of a string beginning with
not
orand
orand not
or nothing. The "nothing" is the same asand
and will only occur when it's the first chunk in the string. For example, if myand OKED{true}
had come at the beginning of the string, theand
would have been omitted andOKED{true}
would have been prefixed by nothing (empty string). But it's the same as an and. - After the operator (
and
ornot
orand not
or nothing) there will always be a string designator that ends just before the curly brace grouping. Example:BATTCOMPAR
- It appears that the string designator will always touch the curly brace grouping with no space in between but I'm not 100% sure. The regex should accommodate the scenario in which a space might come between the string designator and the left curly brace.
- Summary #1 of above points: each chunk will have 3 distinct sub-groups: operator (such as
and not
), string designator (such asBATTCOMPAR
), and curly brace grouping (such as{ForkSpreader}
). - Summary #2 of above points: each chunk will begin with one of the 3 listed operators, or nothing, and end with a right-curly-brace. It is guaranteed that only 1 left-curly-brace and only 1 right-curly-brace will exist within the entire segment, and they will always be grouped together at the end of the segment. There is no fear of encountering additional/stray curly braces in other parts of the segment.
I have experimented with a few different regex constructions:
Match curly brace groupings:
Regex regex = new Regex(@"{(.*?)}");
return regex.Matches(str);
The above almost works, but gets only the curly brace groupings and not the operator and string designator that goes with it.
Capture chunks based on string prefix, trying to match operator strings:
var capturedWords = new List<string>();
string regex = $@"(?<!\w){prefix}\w ";
foreach ( Match match in Regex.Matches(haystack, regex) ) {
capturedWords.Add(match.Value);
}
return capturedWords;
The above partially works, but gets only the operators, and not the entire chunk I need: (operator string designator curly brace grouping)
Thanks in advance for any help.
CodePudding user response:
This works for me: /([and\s|or\s|not\s] )?.*?(\{.*?\})/mg
on Regex Tester
.
On DotNet Fiddle
, this worked for me:
()
- Capture group
[and\\s|or\\s|not\\s] ?
- start with a single and, or, not or combination each followed by a whitespace
.*?
any combination of characters or none, ex. BATTCOMPAR
\\{.*?\\}
the final part enclosed in curly braces which contains any combination of characters or none
string test = "not BATTCOMPAR{275} and FORKCARRIA{ForkSpreader} and SIDESHIFT{WithSSPassAttachCenterLine} and TILTANGLE{4up_2down} and not AUTOMATSS{true} and not FORKLASGUI{true} and not FORKCAMSYS{true} and OKED{true}";
Regex r = new Regex("([and\\s|or\\s|not\\s] ?.*?\\{.*?\\})", RegexOptions.Multiline);
//or if you need to account for matches where there is no
//prepending words ie. and, not and
//Regex r = new Regex("([and\\s|or\\s|not\\s|] ?.*?\\{.*?\\}|.*?\\{.*?\\})", RegexOptions.Multiline);
MatchCollection matches = r.Matches(test);
foreach(Match m in matches)
{
Console.WriteLine(m.Value);
}
Prints:
//not BATTCOMPAR{275}
//and FORKCARRIA{ForkSpreader}
//and SIDESHIFT{WithSSPassAttachCenterLine}
//and TILTANGLE{4up_2down}
//and not AUTOMATSS{true}
//and not FORKLASGUI{true}
//and not FORKCAMSYS{true}
//and OKED{true}
CodePudding user response:
The simplest solution seems to be using this regex and split
:
new Regex(@"[ ](?=and)");
It simply splits on spaces that are followed by and
:
Regex regex = new Regex(@"[ ](?=and)");
return regex.Split(str);
You can see the result here: NetRegexBuilder