Home > front end >  Regex in Python that can extract paths which could contain dots anywhere except end
Regex in Python that can extract paths which could contain dots anywhere except end

Time:08-14

I'm processing some raw texts that may contains several links within it. The example links that may appear in the text look like this:

\\1.123.42.5\foo\bar\file_name
\\remote_node\shared_path\520.38-nonprod\utli\file_name

Links like above can appear any where in the text, at beginning, in the middle or in the end. Notice that the dots can appear in any chunk of the link but not the last char. For example:

"Here is an example text and we want to use regex to extract the links. The links we are interested in are \\1.123.42.5\foo\bar\file_name and \\remote_node\shared_path\520.38-nonprod\util\file_name. This text can be continued with even more sentences ..."

By using re.findall(), I'm hoping to get a list, ["\\1.123.42.5\foo\bar\file_name", "\\remote_node\shared_path\520.38-nonprod\util\file_name"]

Notice that the dot following the link is not included in the second link as it's the period of the sentence. We don't know how may chunks/directories consist of a link (>=2 for sure). We only know that the first chunk allows alphanumerical, dots and underscore. The rest of chunks allow alphanumerical, dots, underscore and hyphens, and the link cannot be ended with a dot.

The regex I have currently is:

first_chunk= r"\w\."
rest_chunk= r"\w\.\-\\"
pattern = re.compile(r"\\\\[%s] \\[%s] " % (first_chunk, rest_chunk))

However, this pattern also add ended dots (if any) to the links. After seeing end-of-line, I also tried

first_chunk= r"\w\."
rest_chunk= r"\w\.\-\\"
pattern = re.compile(r"\\\\[%s] \\[%s] [^\.]$" % (first_chunk, rest_chunk))

or

pattern = re.compile(r"\\\\[%s] \\[%s] [^\.]$" % (first_chunk, rest_chunk), flags=re.MULTILINE )

Neither of the regex can preciously extract the correct links from the text.
I'm wondering how to modify the regex to achieve my goal. Any comments would be extremely appreciated. Thanks!

CodePudding user response:

As @Stuart said in the comments, maybe re module is unnecessary here:

s = r"""Here is an example text and we want to use regex to extract the links. 
The links we are interested in are \\1.123.42.5\foo\bar\file_name and \\remote_node\shared_path\520.38-nonprod\util\file_name. 
This text can be continued with even more sentences ..."""

for word in s.split():
    if word.startswith("\\"):
        print(word.strip("."))

Prints:

\\1.123.42.5\foo\bar\file_name
\\remote_node\shared_path\520.38-nonprod\util\file_name

CodePudding user response:

You could write the final pattern matching as least 2 times / and in the final part of the pattern omit matching the dot.

\\\\[\w.] (?:\\[\w.-] ) \\[\w-] 

Explanation

  • \\\\ Match \\
  • [\w.] Match 1 times either \w or -
  • (?:\\[\w.-] ) Repeat 1 times starting with \ and the same character class as before
  • \\[\w-] Match \ again (to have at least 2 occurrences of \) and match 1 times either \w or -

See a regex demo.

Example

import re

s = r"""Here is an example text and we want to use regex to extract the links. 
The links we are interested in are \\1.123.42.5\foo\bar\file_name and \\remote_node\shared_path\520.38-nonprod\util\file_name. 
This text can be continued with even more sentences ..."""

pattern = r"\\\\[\w.] (?:\\[\w.-] ) \\[\w-] "

print(re.findall(pattern, s))

Output

['\\\\1.123.42.5\\foo\\bar\\file_name', '\\\\remote_node\\shared_path\\520.38-nonprod\\util\\file_name']
  • Related