I have multiple files (nearly 1000 of them) separated by space and I need to compute permutations with join function between each one of them using the second columns only. The important thing is that the comparison must not be repeated, that's why the permutation.
For instance, a small example with 3 files A.txt B.txt and C.txt
The general idea is to get A B comparison, A C and B C. Neither B A nor C A nor C B
The 101 code would be
join -1 2 -2 2 A.txt B.txt | cut -d ' ' -f1 > AB.txt
join -1 2 -2 2 A.txt C.txt | cut -d ' ' -f1 > AC.txt
join -1 2 -2 2 B.txt C.txt | cut -d ' ' -f1 > BC.txt
Is there a way to accomplish this for thousand of files? I tried using a for loop, but toasted my brains out, and now I'm trying with a while loop. But I better get some orientation first.
CodePudding user response:
As the number of iterations is quite large performance becomes an issue. Here is an optimized version of Matty's answer, using an array, to divide the number of iterations by 2 (half a million instead of a million) and to avoid a test:
declare -a files=( *.txt )
declare -i len=${#files[@]}
declare -i lenm1=$(( len - 1 ))
for (( i = 0; i < lenm1; i )); do
a="${files[i]}"
ab="${a%.txt}"
for (( j = i 1; j < len; j )); do
b="${files[j]}"
join -1 2 -2 2 "$a" "$b" | cut -d ' ' -f1 > "$ab$b"
done
done
But consider that bash was not designed for such intensive tasks with half a million iterations. There might be a better (more efficient) way to accomplish what you want.
CodePudding user response:
It looks like what you are after can be accomplished with two nested for loops and a lexicographic comparison to maintain alphabetical order?
# prints pairs of filenames
for f in dir/*; do
for g in dir/*; do
if [[ "$f" < "$g" ]]; then # ensure alphabetical order
echo $f $g
fi
done
done