Home > Back-end >  Regular expression to capture windows paths or filenames starting with X:\ **enclosed or not** by q
Regular expression to capture windows paths or filenames starting with X:\ **enclosed or not** by q

Time:10-28

I need to build a regular expression to capture one or more windows paths inside a text. It's for a syntax highlighter.

Imagine this text:

Hey, Bob!

I left you the report for tomorrow in D:\Files\Shares\report.pdf along
with the other reports.

There's also this pptx here D:\Files\Internal\source.pptx where you have
the original if you need to change anything.

Cheers!
Alice.

This one is easy to capture with /[a-zA-Z]:\\[^\s]*/mg. See it in regex101 here Capturing sample 1

Nevertheless

when the path has spaces like here:

I left you the report for tomorrow in D:\Shared files\october report.pdf along
with the other reports.

then we run into problems: What is the path? D:\Shared or D:\Shared files\october or D:\Shared files\october report.pdf or D:\Shared files\october report.pdf along...

For a human it's simple to infer. For a computer it's impossible so I was thinking into forcing the users to use quotes or brackets to indicate the begin and end of the filename or path.

Question

How can I write a regex that given this:

Hey, Bob!

I left you the report for tomorrow in "D:\Shared files\october report.pdf" along
with the other reports [Don't forget to add your punctuation]. See
also D:\Multifiles\charlie.docx for more info.

There's also this pptx here [D:\Internal files\source for report.pptx] where you have
the original if you need to change "anything like the boss wants".

Cheers!
Alice.

captures this?

D:\Shared files\october report.pdf
D:\Multifiles\charlie.docx
D:\Internal files\source for report.pptx

but not

Don't forget to add your punctuation
anything like the boss wants

Non-working sample: Solution

PD: Inspired here https://es.javascript.info/regexp-lookahead-lookbehind

CodePudding user response:

If you have boundaries like ".." [..] or matching only non whitespace characters:

\b(?:(?<=\[)[a-z]:\\[^]]*(?=])|(?<=")[a-z]:\\[^"]*(?=")|[a-z]:\\\S )

The pattern matches:

  • \b A word boundary to prevent a partial word match
  • (?: Non capture group
    • (?<=\[)[a-z]:\\[^]]*(?=]) Between [...] match a-z :\ then match optional chars other than ] using a negated character class
    • | Or
    • (?<=")[a-z]:\\[^"]*(?=") Between "..." match a-z :\ then match optional chars other than " using a negated character class
    • | Or
    • [a-z]:\\\S Match a-z :\ and then 1 non whitespace chars
  • ) Close the non capture group

See a regex demo.

  • Related