Home > database >  Find a regular expression to get substrings with file extensions
Find a regular expression to get substrings with file extensions

Time:03-11

There are several variants of the strings:

  1. "txt files `(*.txt)|*.txt|All files (*.*)|*.*`"
  2. "Image Files`|*.jpg;*.jpeg;*.png;`"
  3. "Excel Files `(*.xls, *.xlsx)|*.xls;*.xlsx|CSV Files (*.csv)|*.csv`"

The substring can end with any character (space, ',', '.', '|', ';') - it doesn't matter.

Tried the following options: "[^*].{3,4}(.?);", "[^*] .(.?);".

I need a regular expression to get string[] = {.jpg, .jpeg, ...}, preferably without duplicate elements.

CodePudding user response:

Do you really need a regular expression?

First off, if you split by |, each odd entry in the result is a list of extensions. You can then split that again by ; to get the extensions, which you can then flatten into a single sequence and trim each element of the starting *. Finally, get the distinct set of that and put that into an array.

This can all be accomplished with Split and Linq:

var extensions = filter.Split('|', StringSplitOptions.RemoveEmptyEntries)
                       .Where((x, i) => i % 2 != 0)
                       .SelectMany(x => x.Split(';', StringSplitOptions.RemoveEmptyEntries))
                       .Select(x => x.TrimStart('*'))
                       .Distinct()
                       .ToArray();

Removing empty entries from the split ensures that if you end with a separator it just gets ignored.

See it in action on .NET Fiddle.

CodePudding user response:

Simple Split

I think I too would do it with Split and it should be possible to do it like:

str.Split('*',';','|')
  .Where(s => s.StartsWith(".") && s[1..].All(Char.IsLetterOrDigit))
  .Distinct();

Note: This doesn't make any insistence about the length of the extension. You can add something into the Where for these cases if you want, e.g.:

&& s.Length is >3 and <6

.. a 3 or 4 length extension is between 4 and 5 long with the dot and this is where the "greater than 3 and less than 6" comes from. Note that it uses pattern matching which is a recent c# addition. If your c# is older you'll need some older style of length checking..


Regex

..but as a Regex learning opportunity for you, extracting the file extensions from the string is easier with a capturing group:

var r = new Regex(@"\*(?<x>\.\w{3,4})\b");
var arr = r.Matches()
  .Cast<Match>()
  .Select(m => m.Groups["x"].Value)
  .Distinct();

The Regex itself looks for a literal * then starts capturing characters into a group named x with (?<x>. Captured characters are: a literal dot, followed by between 3 and 4 word characters (a-z, 0-9). I chose between 3 and 4 because your code chose that but note extensions can be less or more so you may tweak that. The final bit of the Regex demands a word boundary \b after 3 or 4 chars because we don't want partial matches of extensions longer than 4 chars. A word boundary means the extension has finish (the next char is a non word char) after 3 or 4 chars

To extract that data from this using LINQ we have to do something like Cast the resulting collection entries to a Match; they're Matches already but a MatchCollection doesn't implement IEnumerable<T> because it's old, so it's not LINQ compatible unless we do something like Cast to make it so.

The Select retrieves the string value from the capture group, which is the .xxx extension and Distinct removes duplicates


Your Regex

As to why your tries didn't work:

[^*]
    .{3,4}
          (.?)
              ;

This matches

  • char that is any char except asterisk,
  • followed by 3 or 4 of any char,
  • followed by zero or one of any char, that is captured into an unnamed group,
  • followed by semicolon".

It could perhaps be adjusted to work in some cases but it doesn't seem to specify the pattern of chars you're looking for

[^*] 
     .  
      (.?)
          ;

This matches

  • one or more of char that is any char except asterisk,
  • followed by any char,
  • followed by zero or one of any char that is captured into an unnamed group,
  • followed by semicolon"

I suspect you're thinking that ^ is an escape that allows to match literal * - escape is \,

When ^ used as the first char inside [ ] it means "all chars except" so where I suspect you were trying to match a literal asterisk you actually ended up matching the exact opposite

Actually most chars lose their special meaning when placed inside a character class so [*] would be "match literal asterisk" just like \* is

  • Related