I need to build a regular expression to capture one or more windows paths inside a text. It's for a syntax highlighter.
Imagine this text:
Hey, Bob!
I left you the report for tomorrow in D:\Files\Shares\report.pdf along
with the other reports.
There's also this pptx here D:\Files\Internal\source.pptx where you have
the original if you need to change anything.
Cheers!
Alice.
This one is easy to capture with /[a-zA-Z]:\\[^\s]*/mg
. See it in regex101 here
Nevertheless
when the path has spaces like here:
I left you the report for tomorrow in D:\Shared files\october report.pdf along
with the other reports.
then we run into problems: What is the path? D:\Shared
or D:\Shared files\october
or D:\Shared files\october report.pdf
or D:\Shared files\october report.pdf along
...
For a human it's simple to infer. For a computer it's impossible so I was thinking into forcing the users to use quotes or brackets to indicate the begin and end of the filename or path.
Question
How can I write a regex that given this:
Hey, Bob!
I left you the report for tomorrow in "D:\Shared files\october report.pdf" along
with the other reports [Don't forget to add your punctuation]. See
also D:\Multifiles\charlie.docx for more info.
There's also this pptx here [D:\Internal files\source for report.pptx] where you have
the original if you need to change "anything like the boss wants".
Cheers!
Alice.
captures this?
D:\Shared files\october report.pdf
D:\Multifiles\charlie.docx
D:\Internal files\source for report.pptx
but not
Don't forget to add your punctuation
anything like the boss wants
PD: Inspired here https://es.javascript.info/regexp-lookahead-lookbehind
CodePudding user response:
If you have boundaries like ".."
[..]
or matching only non whitespace characters:
\b(?:(?<=\[)[a-z]:\\[^]]*(?=])|(?<=")[a-z]:\\[^"]*(?=")|[a-z]:\\\S )
The pattern matches:
\b
A word boundary to prevent a partial word match(?:
Non capture group(?<=\[)[a-z]:\\[^]]*(?=])
Between[...]
match a-z:\
then match optional chars other than]
using a negated character class|
Or(?<=")[a-z]:\\[^"]*(?=")
Between"..."
match a-z:\
then match optional chars other than"
using a negated character class|
Or[a-z]:\\\S
Match a-z:\
and then 1 non whitespace chars
)
Close the non capture group
See a regex demo.