Home > front end >  Python - Check for exact string in file name
Python - Check for exact string in file name

Time:12-09

I have a folder where each file is named after a number (i.e. img 1, img 2, img-3, 4-img, etc). I want to get files by exact string (so if I enter '4' as an input, it should only return files with '4' and not any files containing '14' or 40', for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren't always in the same spot (for same files its at the end, for others it's in the middle)

For instance, if my folder has 5 files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', file.mp4, file 4.mp4 ], I would only want to return [ep 4, img4, 4xxx, file 4.mp4]

here is what I have (in this case I only want to return all mp4 file type)

for (root, dirs, file) in os.walk(source_folder):
    for f in file:
        if '.mp4' and ('4') in f:
            print(f)

Tried == instead of in

CodePudding user response:

We can use re.search along with a list comprehension for a regex option:

files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!\d)'   str(num)   r'(?!\d)'
output = [f for f in files if re.search(regex, f)]
print(output)  # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']

CodePudding user response:

this can be accomplished with the following function

import os


files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]


def number_filter(files, number):
    filtered_files = []
    for file_name in files:

        # if the number is not present, we can skip this file
        if file_name.count(str(number)) == 0:
            continue

        # if the number is present in the extension, but not in the file name, we can skip this file
        name, ext = os.path.splitext(file_name)

        if (
            isinstance(ext, str)
            and ext.count(str(number)) > 0
            and isinstance(name, str)
            and name.count(str(number)) == 0
        ):
            continue

        # if the number is preseent in the file name, we must determine if it's part of a different number
        num_index = file_name.index(str(number))

        # if the number is at the beginning of the file name
        if num_index == 0:
            # check if the next character is a digit
            if file_name[num_index   len(str(number))].isdigit():
                continue
            else:
                print(file_name)
                filtered_files.append(file_name)

        # if the number is at the end of the file name
        elif num_index == len(file_name) - len(str(number)):
            # check if the previous character is a digit
            if file_name[num_index - 1].isdigit():
                continue
            else:
                print(file_name)
                filtered_files.append(file_name)

        # if it's somewhere in the middle
        else:
            # check if the previous and next characters are digits
            if (
                file_name[num_index - 1].isdigit()
                or file_name[num_index   len(str(number))].isdigit()
            ):
                continue
            else:
                print(file_name)
                filtered_files.append(file_name)

    return filtered_files


output = number_filter(files, 4)

for file in output:
    assert file in desired_output

for file in desired_output:
    assert file in output

CodePudding user response:

Judging by your inputs, your desired regular expression needs to meet the following criteria:

  1. Match the number provided, exactly
  2. Ignore number matches in the file extension, if present
  3. Handle file names that include spaces

I think this will meet all these requirements:

def generate(n):
    return re.compile(r'^[^.\d]*'   str(n)   r'[^.\d]*(\..*)?$')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(f)]

Usage:

>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']

Note that this solution involves creating a Pattern object and using that object to check each file. This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.

  • Related