Home > Software engineering >  Using rstrip() and lstrip() to remove 1st and last underscore character "_" withinin a str
Using rstrip() and lstrip() to remove 1st and last underscore character "_" withinin a str

Time:09-16

I have a series of .txt file and I want to remove the prefix and suffix to make them easier to read (and do further analysis)

A dummy name would be something like "Test_abcdef_000001.txt", "Test_abcdef_000002.txt" or "Test_abcdeft_000001.txt"

To remove the "Test_" and the "_000001.txt" part, I use rstrip() and lstrip() as followed:

for file in os.listdir(directory):
        if file.endswith(".txt"):
            if file.startswith("Test"):  
                print('old name is: ' file '\n')
                file = file.lstrip('Test_')
                for i in range(20):
                    if file.endswith(str(i).zfill(6) '.txt'):
                            file_1 = file.rstrip('_' str(i).zfill(6) '.txt')
                            print('New name is: '   file_1  '\n')

The first for loop is scan all the file within the directory. The second for loop with i is to deal with the _000001 or _000002 test name.

So, for example, with the following 4 test names, I'm expecting 4 "new" tests names:

  • Test_abcdtt_000001.txt --> abcdtt

  • Test_abct_000001.txt --> abct

  • Test_defg_000001.txt --> defg

  • Test_tcty_000001.txt --> tcty

However, in actual testing, I have the following result

  • Test_abcdtt_000001.txt --> abcd

  • Test_abct_000001.txt --> abc

  • Test_defg_000001.txt --> defg

  • Test_tcty_000001.txt --> cty

In other words, all "t" characters next to the "_" are lost, which is sub-optimal. Is there any advise/suggestion on this problem?

Thank you for your time and support.

For reference: I'm using Python 3.7 on my company computer. So just assume that I can NOT upgrade it to 3.9 and/or import any fancy library. In addition, some of my file may have _ inside them, for example Test_ab_ty_ui_000001.txt, and for this, the end result should be ab_ty_ui.

CodePudding user response:

Maybe try using re to match your desired pattern.

import re

prefix = "Test"
# regex to get everything between 'Test_' and '_{digits}'
regex = rf"^{prefix}_(.*)_(\d ).txt"

# this could also be replaced with glob.glob(f"{directory}/{prefix}*") for be more efficient
for file_name in os.listdir(directory):
    match = re.match(regex, file_name)
    if match:
    print(match.groups()[0])

CodePudding user response:

So, for example, with the following 4 test names, I'm expecting 4 "new" tests names:

Test_abcdtt_000001.txt --> abcdtt

Test_abct_000001.txt --> abct

Test_defg_000001.txt --> defg

Test_tcty_000001.txt --> tcty

names = ['Test_ab_ty_ui_000001.txt','Test_abcdtt_000001.txt', 'Test_abct_000001.txt', 'Test_defg_000001.txt', 'Test_tcty_000001.txt']

new_names = []
for name in names:
    parts = name.split('_')
    new_name = '_'.join(parts[1:-1])
    new_names.append(new_name)
print(new_names)

output

['ab_ty_ui', 'abcdtt', 'abct', 'defg', 'tcty']
  • Related