How to use bash to get unique dates from list of file names-CodePudding

I have a large number of file names. I need to create a bash script that gets all of the unique dates from the file names.

Example:

input:

opencomposition_dxxx_20201123.csv.gz     
opencomposition_dxxv_20201123.csv.gz
opencomposition_dxxu_20201123.csv.gz     
opencomposition_sxxv_20201123.csv.gz
opencomposition_sxxe_20211223.csv.gz 
opencomposition_sxxe_20211224.csv.gz  
opencomposition_sxxe_20211227.csv.gz  
opencomposition_sxxesgp_20230106.csv.gz

output:

20201123 20211224 20211227 20230106

Code:

for asof_dt in `find -H ./ -maxdepth 1 -nowarn -type f -name *open*.gz
| sort -r | cut -f3 -d "_" | cut -f1 -d"." | uniq`; do
    echo $asof_dt
done

Error:

line 20: /bin/find: Argument list too long

CodePudding user response：

Like this (`GNU grep`):

You need to add quotes on the glob: '*open*.gz', if not, the shell expand the wildcard * to any files / dirs from current directory !

find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' |
    grep -oP '_\K\d{8}(?=\.csv)' |
    sort -u

Output

The regular expression matches as follows:

Node	Explanation
`_`	_
`\K`	resets the start of the match (what is `K`ept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex
`\d{8}`	digits (0-9) (8 times)
`(?=`	look ahead to see if there is:
`\.`	.
`csv`	'csv'
`)`	end of look-ahead

CodePudding user response：

Using tr:

find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' | tr -d 'a-z_.' | sort -u

CodePudding user response：

If filenames don't contain newline characters, a quick-and-dirty method, similar to your attempt, might be

printf '%s\n' open*.gz | cut -d_ -f3 | cut -d. -f1 | sort -u

Note that printf is a bash builtin command and argument list too long is not applied to bash builtins.

Like this (GNU grep):

Output

The regular expression matches as follows:

Like this (`GNU grep`):