Using regex to find files with a particular pattern-CodePudding

I am new to regex. I have read various tutorials, still I have failed to run my simple codes.

My files are organized such as "c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4", ... "c1c2c4_bb_41", "c1c8c9_cc_58", "c1c3c11_aa_19"

I want to find all those ones that includes "aa" (such as "c1c2c3_aa_3") and convert them to "c1c2c4_zz_3"

So I want the last number and the first string before "_" remains fixed, but change the "aa" in the middle.

"c1", "c2", "c3" are some conditions. Also, the last numbers are quite random, so I do not know them to define them.

I am interested in using regex.

I tried this:

con_list1 = ["c1", "c2", ... "c8"]
con_list2 = ["c1", "c2", ... "c11"]
con_list3 = ["c1", "c2", ... "c10"]

for con1 in con_list1:
    for con2 in con_list2:
        for con3 in con_list3:
            if(os.path.exists("./"   con1   con2   con3   "_aa(.*)")):
                os.rename("./"   con1   con2   con3   "_aa(.*)", "./"   con1   con2   con3   "_zz(.*)")

I want the last number corresponding to the file that I rename remains fixed:

"c1c2c3_aa_3" -> "c1c2c3_zz_3" "c1c2c3_aa_13" -> "c1c2c3_zz_13"

I am also interested in using regex and (.*) in the right way.

However, the above code seems not working.

I appreciate to help to implement this code.

CodePudding user response：

If you have a list like con_list1 = ["c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4"] you may try something like:

import re


con_list1 = ["c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4"]

regex = r"_aa_"

for test_str in con_list1:

    matches = re.finditer(regex, test_str, re.MULTILINE)

    for match in matches:
        result = match.groups()
        if result:
            test_str[match.start():]   '_zz_'   test_str[:match.end()]

but the most simple way is:

con_list1 = ["c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4"]
for test_str in con_list1:
    test_str .replace('_aa_', '_zz_')

CodePudding user response：

Try this to find all names: "[a-z0-9] _aa_[0-9] "

names = re.findall(r'\"[a-z0-9] \_aa\_[0-9] \"', files_names_list.text, flags=re.I))

files_names_list is a list, where you have all your file names

Hope I understand you correctly

CodePudding user response：

Assuming the files to rename exist in the current directory, would you please try the following:

import os, re
for f in os.listdir('.'):
    m = re.match(r'((?:c\d{1,2}){3})_aa_(\d{1,2})$', f)
    if m:
        newname = m.group(1)   '_zz_'   m.group(2)
        os.rename(f, newname)

((?:c\d{1,2}){3}) matches three repetitions of the set of c one or two digits.
(\d{1,2}) matches one or two digits.
As the regexes above are enclosed by parentheses, the matched substrings are captured by m.group(1) and m.group(2) individually.

CodePudding user response：

You can use

import os, re

con_list1 = ["c1", "c2", "c3","c4","c5","c6","c7","c8"]
con_list2 = ["c1", "c2", "c3","c4","c5","c6","c7","c8", "c9","c10", "c11"]
con_list3 = ["c1", "c2", "c3","c4","c5","c6","c7","c8", "c9","c10"]
regex = re.compile(f'^((?:{"|".join(map(re.escape, con_list1))})(?:{"|".join(map(re.escape, con_list2))})(?:{"|".join(map(re.escape, con_list3))}))_aa_')

rootdir = "YOUR_ROOT_DIR"
for root, dirs, files in os.walk(rootdir):
    for file in files:
        if regex.search(file):
            os.rename(file, regex.sub(r'\g<1>_zz_', file))

Note: os.walk() searches in all subdirs recursively, if you do not need that behavior, see Non-recursive os.walk().

This is not the most efficient way to create a dynamic pattern (a regex TRIE would be better), but it shows a viable approach. The regex will look like

^((?:c1|c2|c3|c4|c5|c6|c7|c8)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10|c11)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10))_aa_

See the regex demo. Note that each item in your condition lists is re.escaped to make sure special chars do not prevent your file names from matching.

Details:

^ - start of string
((?:c1|c2|c3|c4|c5|c6|c7|c8)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10|c11)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10)) - Group 1 (\g<1> refers to this group value, if _zz_ is not a placeholder for text starting with a digit, you can even use \1 instead): a value from con_list1, then a value from con_list2 and then a value from con_list3
_aa_ - an _aa_ fixed string.