Given array1 I want to find every unique, first occurrence of each csv entry. The array is already ordered by date. So the first occurrence will be the most recent.
array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
I want to return an array with every unique, most recent occurrence of each csv entry
array2=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ c.csv/ d.csv/)
an array of all the duplicate entries
array3=(s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
My thought process is as follows - Loop through the array, if the element is an s3 path check the preceding elements and write the s3 path and csv files to a new array. Stop when the preceding element is another s3 path. If the following s3 path contains the same csv files write to a duplicate array. If the following s3 path contains new csv files append to the new array.
CodePudding user response:
Would you please try the following:
#!/bin/bash
declare -A seen # check if the csv element has appeared
array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
array2=(); array3=()
while read -r first others; do # split the line into "s3:.." and others
read -r -a ary <<< "$others" # split others into list of csv's
dup=(); new=() # temporary arrays
for i in "${ary[@]}"; do # loop over the csv's
(( seen[$i] )) && dup =( "$i" ) || new =( "$i" ) # sort the csv's depending on the history
done
(( ${#new[@]} )) && array2 =( "$first" "${new[@]}" ) # if array "new" is non-empty, append to array2
(( ${#dup[@]} )) && array3 =( "$first" "${dup[@]}" ) # if array "dup" is non-empty, append to array3
done < <(sed -E 's# (s3://)#'\\$'\n''\1#g' <<< "${array1[*]}") # construct 2-d structure from array1
echo "${array2[@]}"
echo "${array3[@]}"
Output:
s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/
As array1
looks like having a 2-d structure, I've first rearranged the elements with sed
into:
s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-07-22/ a.csv/
then process them line by line.