how to remove string ending with-CodePudding

I have file names like

ios_g1_v1_yyyymmdd
ios_g1_v1_h1_yyyymmddhhmmss
ios_g1_v1_h1_YYYYMMDDHHMMSS
ios_g1_v1_g1_YYYY
ios_g1_v1_j1_YYYYmmdd
ios_g1_v1
ios_g1_v1_t1_h1
ios_g1_v1_ty1_f1

I would like to remove only the suffix when it matches the string YYYYMMDDHHMMSS OR yyyymmdd OR YYYYmmdd OR YYYY

my expected output would be

ios_g1_v1
ios_g1_v1_h1
ios_g1_v1_h1
ios_g1_v1_g1
ios_g1_v1_j1
ios_g1_v1
ios_g1_v1_t1_h1
ios_g1_v1_ty1_f1

How can I achieve this in python using regex ? i tried with something like below, but it didn't work

word_trimmed_stage1 = re.sub('.*[^YYYYMMDDHHMMSS]$', '', filename)

CodePudding user response：

You can be explicit and use the exact patterns that you have identified, optionally case insensitive with re.I:

files = ['ios_g1_v1_yyyymmdd',
 'ios_g1_v1_h1_yyyymmddhhmmss',
 'ios_g1_v1_h1_YYYYMMDDHHMMSS',
 'ios_g1_v1_g1_YYYY',
 'ios_g1_v1_j1_YYYYmmdd',
 'ios_g1_v1',
 'ios_g1_v1_t1_h1',
 'ios_g1_v1_ty1_f1']

files2 = [re.sub('_(?:YYYYMMDDHHMMSS|yyyymmdd|YYYYmmdd|YYYY)$', '', x, flags=re.I)
          for x in files]

NB. with re.I you only need one of yyyymmdd/YYYYmmdd.

Compressed variant:

files2 = [re.sub('_YYYY(?:MMDD(?:HHMMSS)?)?$', '', x, flags=re.I) for x in files]

Output:

['ios_g1_v1',
 'ios_g1_v1_h1',
 'ios_g1_v1_h1',
 'ios_g1_v1_g1',
 'ios_g1_v1_j1',
 'ios_g1_v1',
 'ios_g1_v1_t1_h1',
 'ios_g1_v1_ty1_f1']

CodePudding user response：

To remove a string ending with "YYYYMMDDHHMMSS" or one of the other specified formats, you can use the rstrip method. This method will remove all characters in the specified string that appear at the end of the target string.

Here's an example of how you can use it: s = "abcdefgYYYYMMDDHHMMSS" suffix = "YYYYMMDDHHMMSS"

You can also use to remove the other specified formats by replacing "YYYYMMDDHHMMSS" with the appropriate format string.

CodePudding user response：

Disclaimer: this is a non regex approach; @mozway posted a good regex approach

files = ['ios_g1_v1_yyyymmdd',
 'ios_g1_v1_h1_yyyymmddhhmmss',
 'ios_g1_v1_h1_YYYYMMDDHHMMSS',
 'ios_g1_v1_g1_YYYY',
 'ios_g1_v1_j1_YYYYmmdd',
 'ios_g1_v1',
 'ios_g1_v1_t1_h1',
 'ios_g1_v1_ty1_f1']

lst=[]
for filenames in files:
  k=[]
  for x in range(len(filenames)-1):
    if filenames[x]=='y' or filenames[x]=='Y':
        if filenames[x 1]=='y' or filenames[x 1]=='Y':
            break
    else:
        k.append(filenames[x])
  if k[-1]=='_':
    lst.append(''.join(k)[:-1])
  else:
    lst.append(''.join(k))
    
print(lst)

#['ios_g1_v1', 'ios_g1_v1_h1', 'ios_g1_v1_h1', 'ios_g1_v1_g1', 'ios_g1_v1_j1', 'ios_g1_v', 'ios_g1_v1_t1_h', 'ios_g1_v1_t1_f']

CodePudding user response：

IIUC, your pattern involves Year, Month, Day, Hour, Minute, Second characters with any number of repeated characters in that order, starting with an underscore and case-insensitive.

Try this pattern r"_Y M*D*H*M*S*" -

import re

regex_pattern = r"_Y M*D*H*M*S*"
result = [re.sub(regex_pattern,'',i, flags=re.IGNORECASE) for i in l]
result

['ios_g1_v1',
 'ios_g1_v1_h1',
 'ios_g1_v1_h1',
 'ios_g1_v1_g1',
 'ios_g1_v1_j1',
 'ios_g1_v1',
 'ios_g1_v1_t1_h1',
 'ios_g1_v1_ty1_f1']

EXPLANATION

The _ matches the underscore at start of the patter
The flags=re.IGNORECASE makes this pattern search case-insensitive
The Y matches at least 1 instance of Y
Then the M*D*H*M*S* match any instances of these specific characters after the initial Y in that order (starting 0 instances)

CodePudding user response：

This can be another approach

out = []
for filename in filenames:
    if filename.split("_")[-1].lower().startswith("y"):
        out.append("_".join(filename.split("_")[:-1]))
    else:
        out.append(filename)
        
print(out)

You can also make good use of list() function instead of append one element at a time:

out = list(
    "_".join(filename.split("_")[:-1])
    if filename.split("_")[-1].lower().startswith("y")
    else filename
    for filename in filenames
    )

Both approach should produce the same output: Output:

['ios_g1_v1',
 'ios_g1_v1_h1',
 'ios_g1_v1_h1',
 'ios_g1_v1_g1',
 'ios_g1_v1_j1',
 'ios_g1_v1',
 'ios_g1_v1_t1_h1',
 'ios_g1_v1_ty1_f1']

CodePudding user response：

Try removing everything after the last _ detected.