For example, I have the following string.
Hello, world, he is my student.
I want to get the count of evey word by distincting:
ell->2,
orl->3,
tuden->5,
I want to get an int array like {2,3,0,0,0,5}.
Note: I want to use the regex to do it in c#.
CodePudding user response:
You could use Regex to identify the mid sections of words, capture them into a group and then use a matchevaluator to process the captured text for distinct chars (using LINQ)..
var distCounts = new List<int>();
Regex.Replace(
"your input text here",
@"\b\w(\w*?)\w\b",
m => { distCounts.Add(m.Groups[1].Value.Distinct().Count()); return "";}
);
I don't like it as much (for the simple case eg no punctuation) as the LINQ from the comments:
text.Split().Select(w => w[1..^1].Distinct().Count()).ToArray()
..but I suppose it does give more control over what is considered "a word"
As to how either works -
- Regex looks for a word boundary followed by one word char then captures zero or more chars into a group 1, then demands another word char and a boundary
Split()
splits on whitespace,[1..^1]
takes a slice of the string "from 1 char in from the start to 1 char back from the end"- Then both approaches treat a string as an
IEnumerable<char>
and get the distinct chars and count them
There isn't any protection against 1-long words in the Split version; perhaps a ternary that returns 0 if the length is 1, or a Where that omits the word entirely
I don't know if it would be a performance improvement to return m.Groups[0].Value
from the match evaluator; perhaps if Regex sees the same value come back as it sent it performs no replace, which could cut down on some string ops. If you're concerned about it, swap out the Replace for a standard call to Matches
Regex.Matches(
"your input text here",
@"\b\w(\w*?)\w\b",
).Cast<Match>()
.Select(m => m.Groups[1].Value.Distinct().Count()))
.ToArray();