Home > Enterprise >  bash: constructing complex regexp for find
bash: constructing complex regexp for find

Time:11-23

Cleaning old backups, and found some results of something wrong. One user's backup contains files with strange names like "37&@4ez98d". In order to automating the cleaning process I tried to find all such files and did that with such regexp:

find -regextype sed -regex '.*\/[[:digit:]a-z[:punct:]]\{10\}'

All these names are of 10 characters long, and contains digits, small latins and some punctuations. The find worked almost perfectly, but it also found some files with the "legal" names like 07-709.pdf. And I can not construct the regexp like "anywhere inside given subtree, 10 characters include digits, small latins and SOME punctuations except for dot and minus sign"

I tried everything I could, but I could not make find to ignore the minuses and dots. These symbols may appear anywhere inside the file name, so I can't rely on their fixed placement. Placing something like [^.] (in any variations) produced no usable results. Grepping the find's results for dots and minuses is also useless because these symbols may occur in directories' names, and filtering these out may filter out the "bad" filenames also. I can not enumerate all punctuations possible because I can miss something: I have no idea what "alphabet" was used to scramble these names, while I'm pretty sure that it does not contain dots and minuses.

I managed to workaround the problem, pipelining find's output to some additional checking routine (it was one-liner, additional newlines were inserted for readability only):

find -regextype sed -regex '.*\/[[:digit:]a-z[:punct:]]\{10\}'| \
while read a; do \
b=${a: -10}; [[ ! "$b" =~ .*[\-\.] .* ]] && echo $b \
done

but the trick I need is the single regexp.

Any suggestions please?

Some real data for tesing (four first are to be found, three latter are to be ignored):

rxoxywiy7l
u29t@5%0qd
im^ua&saeo
y6mxn2wnkb
07-709.pdf
3023-7.pdf
18099.docx

Thank you.

CodePudding user response:

If you are not happy with the semantics of [:punct:] you need to spell out which punctuation characters exactly you want to match.

Quick Duck Duck Going gets me [][!"#$%&'()* ,./:;<=>?@\^_`{|}~-] for the full character class, so excluding dot and minus, try

find -regextype sed -regex '.*\/[][!"#$%&'"'"'()* ,/:;<=>?@\^_`{|}~[:digit:]a-z]\{10\}'

(I had to move the punctuation to the front for simplicity, and break out the single quote into a separate double-quoted string).

As an aside, piping find output to while read is prone to some complications; probably prefer -exec basename {} or something similar to print the file names. (GNU find also has a -printf operator with a rich set of format codes.) See also https://mywiki.wooledge.org/BashFAQ/020

As for grepping the results from find, you can easily anchor the regex to anything after the last slash.

find -type f -name '??????????' |
grep '/[^/.-]*$'

(again subject to the various caveats of the FAQ I linked above) ... though as @oguzismail notes, this can be simplified to just

find -type f -name '??????????' ! -name '*[.-]*'

or even

find -type f -name '[!.-][!.-][!.-][!.-][!.-][!.-][!.-][!.-][!.-][!.-]'

If you wanted to go all-in, you could use the Unicode database to extract all characters which count as punctuation; this is still in theory subject to the whims of the locale of the process which generated these file names (which you can't know) but probably in practice quite sufficient. If your find supports a -regextype which implements Perl / PCRE semantics, you could even use the Perl Unicode escape \L{Po} (but alas, it apparently doesn't). Here's a list but notice also the various other punctuation classes on the category page.

  • Related