What I am trying to do is take my current string and remove all data from it that doesn't contain the actual software version. Here is the string I am currently working with:
print (CurrentVersion)
Delivers the output:
2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here, 2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory, 2021, \\\\here\\is\\another\\path_2021, 2020, http://some.will/even/look/like/this, 2022r2, 2023
When what I really want is this for an output:
2018, 2019, 2020, 2021, 2022r2, 2023
What I have tried was to come up with a regular expression to remove the excess data. It looks like '[0-9, ]' will pull out the numbers and commas getting me closer to my goal. So I came up with this code:
RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())
But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can't seem to get that far.
So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?
CodePudding user response:
You might use a capture group:
(?:^|,\s*)(\d{4}\w*)(?=,|$)
The pattern matches:
(?:^|,\s*)
Match either the start of the string, or match a comma followed by optional whitespace chars(\d{4}\w*)
Capture at least 4 digits followed by optional word characters(?=,|$)
Assert either a comma or the end of the string to the right
See a regex demo
Example
import re
pattern = r"(?:^|,\s*)(\d{4}\w*)(?=,|$)"
s = ("2018, \\\\\\\\some\\\\directory\\\\is\\\\here, \\\\\\\\some\\\\directory\\\\is\\\\here, 2019, \\\\\\\\here\\\\is\\\\another\\\\directory, \\\\\\\\here\\\\is\\\\another\\\\directory, 2021, \\\\\\\\here\\\\is\\\\another\\\\path_2021, 2020, http://s...content-available-to-author-only...e.will/even/look/like/this, 2022r2, 2023\n")
print(re.findall(pattern, s))
Output
['2018', '2019', '2021', '2020', '2022r2', '2023']
Other options could be finding all the years that start with 20 and then optionally match r
followed by 1 of more digits:
(?:^|,\s*)(20\d\d(?:r\d )?)(?=,|$)
Or matching 4 digits followed by all except a comma:
(?:^|,\s*)(\d{4}[^,]*)
CodePudding user response:
Your first problem is that the regex [0-9, ]
will match any character that is a digit from 0 to 9, a comma, or a space. This will match each digit in a number individually, as well as commas and spaces which you don't want. Additionally, it won't match the r in your expected output of 2022r2
and will match the digits of 2021 in "\\here\is\another\path_2021"
I would instead recommend using (:? |^)(\d (?:r\d)?)
. First, this checks to make sure that the year is preceded with either a space or the start of the string. Next is a capturing group, which matches a string which contains 1 or more digits (\d
), and optionally matches an extension ((?:r\d)?
) containing the letter "r" and one more digit. If your input could contain more than one digit following the letter "r", you could instead replace this part with (?:r\d )?
.
Your second, bigger problem is that you use RegexVersion.search(CurrentVersion)
, which only returns the first match in the string.
I would instead recommend using RegexVersion.findall(CurrentVersion)
, which would return an array of all matches. You could then optionally join that array into one long comma-seperated string using
", ".join(CurrentVersion).