bash script to return every unique, first occurrence of elements from array-CodePudding

Given array1 I want to find every unique, first occurrence of each csv entry. The array is already ordered by date. So the first occurrence will be the most recent.

array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)

I want to return an array with every unique, most recent occurrence of each csv entry

array2=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ c.csv/ d.csv/)

an array of all the duplicate entries

array3=(s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)

My thought process is as follows - Loop through the array, if the element is an s3 path check the preceding elements and write the s3 path and csv files to a new array. Stop when the preceding element is another s3 path. If the following s3 path contains the same csv files write to a duplicate array. If the following s3 path contains new csv files append to the new array.

CodePudding user response：

Would you please try the following:

#!/bin/bash

declare -A seen                                                         # check if the csv element has appeared

array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
array2=(); array3=()

while read -r first others; do                                          # split the line into "s3:.." and others
    read -r -a ary <<< "$others"                                        # split others into list of csv's
    dup=(); new=()                                                      # temporary arrays
    for i in "${ary[@]}"; do                                            # loop over the csv's
        (( seen[$i]   )) && dup =( "$i" ) || new =( "$i" )              # sort the csv's depending on the history
    done
    (( ${#new[@]} )) && array2 =( "$first" "${new[@]}" )                # if array "new" is non-empty, append to array2
    (( ${#dup[@]} )) && array3 =( "$first" "${dup[@]}" )                # if array "dup" is non-empty, append to array3
done < <(sed -E 's# (s3://)#'\\$'\n''\1#g' <<< "${array1[*]}")          # construct 2-d structure from array1

echo "${array2[@]}"
echo "${array3[@]}"

Output:

s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/

As array1 looks like having a 2-d structure, I've first rearranged the elements with sed into:

s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-07-22/ a.csv/

then process them line by line.