Regular Expression to Pull Information from String in Python-CodePudding

What I am trying to do is take my current string and remove all data from it that doesn't contain the actual software version. Here is the string I am currently working with:

print (CurrentVersion)

Delivers the output:

2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here,  2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory,  2021, \\\\here\\is\\another\\path_2021,   2020, http://some.will/even/look/like/this,   2022r2,   2023

When what I really want is this for an output:

2018, 2019, 2020, 2021, 2022r2, 2023

What I have tried was to come up with a regular expression to remove the excess data. It looks like '[0-9, ]' will pull out the numbers and commas getting me closer to my goal. So I came up with this code:

RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())

But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can't seem to get that far.

So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?

CodePudding user response：

You might use a capture group:

(?:^|,\s*)(\d{4}\w*)(?=,|$)

The pattern matches:

(?:^|,\s*) Match either the start of the string, or match a comma followed by optional whitespace chars
(\d{4}\w*) Capture at least 4 digits followed by optional word characters
(?=,|$) Assert either a comma or the end of the string to the right

See a regex demo

Example

import re
 
pattern = r"(?:^|,\s*)(\d{4}\w*)(?=,|$)"
 
s = ("2018, \\\\\\\\some\\\\directory\\\\is\\\\here, \\\\\\\\some\\\\directory\\\\is\\\\here,  2019, \\\\\\\\here\\\\is\\\\another\\\\directory, \\\\\\\\here\\\\is\\\\another\\\\directory,  2021, \\\\\\\\here\\\\is\\\\another\\\\path_2021,   2020, http://s...content-available-to-author-only...e.will/even/look/like/this,   2022r2,   2023\n")
 
print(re.findall(pattern, s))

Output

['2018', '2019', '2021', '2020', '2022r2', '2023']

Other options could be finding all the years that start with 20 and then optionally match r followed by 1 of more digits:

(?:^|,\s*)(20\d\d(?:r\d )?)(?=,|$)

Regex demo

Or matching 4 digits followed by all except a comma:

(?:^|,\s*)(\d{4}[^,]*)

Regex demo

CodePudding user response：

Your first problem is that the regex [0-9, ] will match any character that is a digit from 0 to 9, a comma, or a space. This will match each digit in a number individually, as well as commas and spaces which you don't want. Additionally, it won't match the r in your expected output of 2022r2 and will match the digits of 2021 in "\\here\is\another\path_2021"

I would instead recommend using (:? |^)(\d (?:r\d)?). First, this checks to make sure that the year is preceded with either a space or the start of the string. Next is a capturing group, which matches a string which contains 1 or more digits (\d ), and optionally matches an extension ((?:r\d)?) containing the letter "r" and one more digit. If your input could contain more than one digit following the letter "r", you could instead replace this part with (?:r\d )?.

Your second, bigger problem is that you use RegexVersion.search(CurrentVersion), which only returns the first match in the string.

I would instead recommend using RegexVersion.findall(CurrentVersion), which would return an array of all matches. You could then optionally join that array into one long comma-seperated string using

", ".join(CurrentVersion).