Home > Blockchain >  how to find the required pattern for a file using regex in python?
how to find the required pattern for a file using regex in python?

Time:11-10

I tried to match the pattern of a file in my folders the file extension is a pdf.

I have many pdf files that have the same pattern but with different name at the end.

the pattern includes date name of the file.

The problem is that when I run the script the system consider the both file name as the first pattern (python_pt) and do not go for the elif statement.

Example:

  • 10-11-2021 python.pdf
  • 22-09-2021 java.pdf

Code:

import re 
import  os 
from os import path 
from tqdm import tqdm
from time import sleep 

python_pt= "^[0-3]?[0-9]-[0-3]?[0-9]-(?:[0-9]{2})?[0-9]{2}$ python.pdf"
java_pt1= "^[0-3]?[0-9]-[0-3]?[0-9]-(?:[0-9]{2})?[0-9]{2}$ java.pdf"
java_pt2= "^ java [0-3]?[0-9]-[0-3]?[0-9]-(?:[0-9]{2})?[0-9]{2}$.pdf"
str = 'c:'
a = 0
i = 0
for dirpath, dirnames, files in os.walk(src, topdown=True):         
    print(f'\nFound directory: {dirpath}\n')
    
    for  file in tqdm(files):
        sleep(.1)
        full_file_name = os.path.join(dirpath, file)
        if os.path.join(dirpath) == src:
            if file.endswith("pdf"):
                if python_pt:
                    i =1
                elif java_pt1 or java_pt2:
                    a =1
print("{} file 1 \n".format(i))
print("{} file 2 \n".format(a))

CodePudding user response:

The problems are with your regular expressions and the way you perform a regex check:

  • The anchors must not be used randomly inside the pattern; $ renders the pattern invalid once you use it in the middle (there can be no chars after end of string). As you need to check if file names end with your pattern, add $ at the end only, and do not forget to escape literal .
  • To check if there is a match you need to use one of the re.search / re.match / re.fullmatch methods.

Here is a fixed snippet:

import re, os
from os import path 
from tqdm import tqdm
from time import sleep 

python_pt= r"[0-3]?[0-9]-[0-3]?[0-9]-(?:[0-9]{2})?[0-9]{2} python\.pdf$" # FIXED
java_pt1= r"[0-3]?[0-9]-[0-3]?[0-9]-(?:[0-9]{2})?[0-9]{2} java\.pdf$"    # FIXED
java_pt2= r"java [0-3]?[0-9]-[0-3]?[0-9]-(?:[0-9]{2})?[0-9]{2}\.pdf$"    # FIXED

src = "C:"
i=0
a=0

for dirpath, dirnames, files in os.walk(src, topdown=True):         
    print(f'\nFound directory: {dirpath}\n')
    
    for  file in tqdm(files):
        sleep(.1)
        full_file_name = os.path.join(dirpath, file)
        if os.path.join(dirpath) == src:
            if file.endswith("pdf"):
                if re.search(python_pt, file):                               # FIXED
                    i =1
                elif re.search(java_pt1, file) or re.search(java_pt2, file): # FIXED
                    a =1
print("{} file 1 \n".format(i))
print("{} file 2 \n".format(a))

See the # FIXED lines.

  • Related