Home > Net >  Using regex to find files with a particular pattern
Using regex to find files with a particular pattern

Time:11-02

I am new to regex. I have read various tutorials, still I have failed to run my simple codes.

My files are organized such as "c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4", ... "c1c2c4_bb_41", "c1c8c9_cc_58", "c1c3c11_aa_19"

I want to find all those ones that includes "aa" (such as "c1c2c3_aa_3") and convert them to "c1c2c4_zz_3"

So I want the last number and the first string before "_" remains fixed, but change the "aa" in the middle.

"c1", "c2", "c3" are some conditions. Also, the last numbers are quite random, so I do not know them to define them.

I am interested in using regex.

I tried this:

con_list1 = ["c1", "c2", ... "c8"]
con_list2 = ["c1", "c2", ... "c11"]
con_list3 = ["c1", "c2", ... "c10"]

for con1 in con_list1:
    for con2 in con_list2:
        for con3 in con_list3:
            if(os.path.exists("./"   con1   con2   con3   "_aa(.*)")):
                os.rename("./"   con1   con2   con3   "_aa(.*)", "./"   con1   con2   con3   "_zz(.*)")

I want the last number corresponding to the file that I rename remains fixed:

"c1c2c3_aa_3" -> "c1c2c3_zz_3" "c1c2c3_aa_13" -> "c1c2c3_zz_13"

I am also interested in using regex and (.*) in the right way.

However, the above code seems not working.

I appreciate to help to implement this code.

CodePudding user response:

If you have a list like con_list1 = ["c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4"] you may try something like:

import re


con_list1 = ["c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4"]

regex = r"_aa_"

for test_str in con_list1:

    matches = re.finditer(regex, test_str, re.MULTILINE)

    for match in matches:
        result = match.groups()
        if result:
            test_str[match.start():]   '_zz_'   test_str[:match.end()]

but the most simple way is:

con_list1 = ["c1c2c4_aa_1", "c1c2c3_aa_2", "c1c2c8_aa_3", "c1c3c4_aa_4"]
for test_str in con_list1:
    test_str .replace('_aa_', '_zz_')

CodePudding user response:

Try this to find all names: "[a-z0-9] _aa_[0-9] "

names = re.findall(r'\"[a-z0-9] \_aa\_[0-9] \"', files_names_list.text, flags=re.I))

files_names_list is a list, where you have all your file names

Hope I understand you correctly

CodePudding user response:

Assuming the files to rename exist in the current directory, would you please try the following:

import os, re
for f in os.listdir('.'):
    m = re.match(r'((?:c\d{1,2}){3})_aa_(\d{1,2})$', f)
    if m:
        newname = m.group(1)   '_zz_'   m.group(2)
        os.rename(f, newname)
  • ((?:c\d{1,2}){3}) matches three repetitions of the set of c one or two digits.
  • (\d{1,2}) matches one or two digits.
  • As the regexes above are enclosed by parentheses, the matched substrings are captured by m.group(1) and m.group(2) individually.

CodePudding user response:

You can use

import os, re

con_list1 = ["c1", "c2", "c3","c4","c5","c6","c7","c8"]
con_list2 = ["c1", "c2", "c3","c4","c5","c6","c7","c8", "c9","c10", "c11"]
con_list3 = ["c1", "c2", "c3","c4","c5","c6","c7","c8", "c9","c10"]
regex = re.compile(f'^((?:{"|".join(map(re.escape, con_list1))})(?:{"|".join(map(re.escape, con_list2))})(?:{"|".join(map(re.escape, con_list3))}))_aa_')

rootdir = "YOUR_ROOT_DIR"
for root, dirs, files in os.walk(rootdir):
    for file in files:
        if regex.search(file):
            os.rename(file, regex.sub(r'\g<1>_zz_', file))

Note: os.walk() searches in all subdirs recursively, if you do not need that behavior, see Non-recursive os.walk().

This is not the most efficient way to create a dynamic pattern (a regex TRIE would be better), but it shows a viable approach. The regex will look like

^((?:c1|c2|c3|c4|c5|c6|c7|c8)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10|c11)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10))_aa_

See the regex demo. Note that each item in your condition lists is re.escaped to make sure special chars do not prevent your file names from matching.

Details:

  • ^ - start of string
  • ((?:c1|c2|c3|c4|c5|c6|c7|c8)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10|c11)(?:c1|c2|c3|c4|c5|c6|c7|c8|c9|c10)) - Group 1 (\g<1> refers to this group value, if _zz_ is not a placeholder for text starting with a digit, you can even use \1 instead): a value from con_list1, then a value from con_list2 and then a value from con_list3
  • _aa_ - an _aa_ fixed string.
  • Related