Find groups of files that end with the same 17 characters-CodePudding

I'm grabbing files that have a unique and common pattern. I'm trying to match on the common. Currently trying with bash. I can use python or whatever.

file1_02_01_2021_002244.mp4
file2_02_01_2021_002244.mp4
file3_02_01_2021_002244.mp4
# _02_01_2021_002244.mp4 should be the 'match all files that contain this string'

file1_03_01_2021_092200.mp4
file2_03_01_2021_092200.mp4
file3_03_01_2021_092200.mp4
# _03_01_2021_092200.mp4 is the match
...    
file201_01_01_2022_112230.mp4
file202_01_01_2022_112230.mp4
file203_01_01_2022_112230.mp4
# _01_01_2022_112230.mp4 is the match

the goal is to find all that are matching from the very end of the file back to the first uniq character, then move them into a folder. The actionable part will be easy. I just need help with the matching.

find -type f $("all that match the same last 17 characters of the file name"); do
    do things
done

CodePudding user response：

In python:

import os

os.chdir("folder_path")

data = {}
data = [[file[-22:], file] for file in os.listdir()]

output = {}
for pattern, filename in data:
    output.setdefault(pattern, []).append(filename)
print(output)

This will create a dict associating each file with the corresponding pattern.

Output:

{
    '_03_01_2021_092200.mp4': ['file1_03_01_2021_092200.mp4', 'file3_03_01_2021_092200.mp4', 'file2_03_01_2021_092200.mp4'], 
    '_01_01_2022_112230.mp4': ['file202_01_01_2022_112230.mp4', 'file201_01_01_2022_112230.mp4', 'file203_01_01_2022_112230.mp4'], 
    '_02_01_2021_002244.mp4': ['file1_02_01_2021_002244.mp4', 'file2_02_01_2021_002244.mp4', 'file3_02_01_2021_002244.mp4']
}

CodePudding user response：

There are several ways to approach this, including writing a bash script, but if it were me, I'd take the quick and easy road. Use grep and read:

PATTERN=_02_01_2021_002244.mp4
find . -name '*.mp4' | grep $PATTERN; while read -t 1 A; do echo $A; done

There are probably better ways that I haven't thought of but this gets the job done.

CodePudding user response：

Try to play with this

first get all pattern sorted and uniq

find ./data -type f -name "*.mp4" | sed 's/^[^_]*//'|sort -u

or with regex

find ./data -type f -regextype sed -regex '.*_[0-9]\{2\}_[0-9]\{2\}_[0-9]\{4\}_[0-9]\{6\}\.mp4$'| sed 's/^[^_]*//'|sort -u

then iterate the the pattern via while loop to find files for every pattern

while read pattern
do
   # find and exec
   find ./data -type f -name "*$pattern" -exec mv {} /to/whatever/you/want/ \;
   #or find and xargs
   find ./data -type f -name "*$pattern" | xargs -I {} mv {} /to/whaterver/you/want/
done < <(find ./data -type f -name "*.mp4" | sed 's/^[^_]*//'|sort -u)

CodePudding user response：

Try this:

#!/bin/bash

while IFS= read -r line
do
    if [[ "$line" == *_ ([0-9])_ ([0-9])_ ([0-9])_ ([0-9])\.mp4 ]]
    then
        echo "MATCH: $line"
    else
        echo "no match: $line"
    fi
done < <(/bin/ls -c1)

Remember that is uses globbing, not regex when you build your pattern.

That is why I did not use [0-9]{2} to match 2 digits, {} does not do that in globbing, like it does in regex.

To use regex, use:

#!/bin/bash

while IFS= read -r line
do
    if [[ $(echo "$line" | grep -cE '*_[0-9]{2}_[0-9]{2}_[0-9]{4}_[0-9]{6}\.mp4') -ne 0 ]]
    then
        echo "MATCH: $line"
    else
        echo "no match: $line"
    fi
done < <(/bin/ls -c1)

This is a more precise match since you can specify how many digits to accept in each sub-pattern.