I have a large number of file names. I need to create a bash script that gets all of the unique dates from the file names.
Example:
input:
opencomposition_dxxx_20201123.csv.gz
opencomposition_dxxv_20201123.csv.gz
opencomposition_dxxu_20201123.csv.gz
opencomposition_sxxv_20201123.csv.gz
opencomposition_sxxe_20211223.csv.gz
opencomposition_sxxe_20211224.csv.gz
opencomposition_sxxe_20211227.csv.gz
opencomposition_sxxesgp_20230106.csv.gz
output:
20201123 20211224 20211227 20230106
Code:
for asof_dt in `find -H ./ -maxdepth 1 -nowarn -type f -name *open*.gz
| sort -r | cut -f3 -d "_" | cut -f1 -d"." | uniq`; do
echo $asof_dt
done
Error:
line 20: /bin/find: Argument list too long
CodePudding user response:
Like this (GNU grep
):
You need to add quotes on the glob: '*open*.gz'
, if not, the shell expand the wildcard *
to any files / dirs from current directory !
find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' |
grep -oP '_\K\d{8}(?=\.csv)' |
sort -u
Output
20201123
20211223
20211224
20211227
20230106
The regular expression matches as follows:
Node | Explanation |
---|---|
_ |
_ |
\K |
resets the start of the match (what is K ept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex |
\d{8} |
digits (0-9) (8 times) |
(?= |
look ahead to see if there is: |
\. |
. |
csv |
'csv' |
) |
end of look-ahead |
CodePudding user response:
Using tr
:
find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' | tr -d 'a-z_.' | sort -u
CodePudding user response:
If filenames don't contain newline characters, a quick-and-dirty method, similar to your attempt, might be
printf '%s\n' open*.gz | cut -d_ -f3 | cut -d. -f1 | sort -u
Note that printf
is a bash builtin command and argument list too long
is not applied to bash builtins.