I need to find a way to display the missing numbers from a large txt file. It's a web graph that has 875,713 vertices. However, when I sort the file the largest number that is displayed at the end is 916,427. So there are some numbers not being used for vertex index. Is there a bash command I could use to do this?
I found this after searching around some other threads but I'm not entirely sure if its correct:
awk 'NR != $1 { for (i = prev 1; i < $1; i ) {print i} } { prev = $1 1 }' file
CodePudding user response:
If you don't want to store the array in memory (otherwise @jared_mamrot solution would work), you can use
awk 'NR==1 {p=$1; next} {for (i=p 1; i<$1; i ) {print i}; p=$1}' < <( sort -n file)
which sorts the file first.
CodePudding user response:
Assuming the 'number' of each vertex is in the first column, you can use:
awk '{a[$1]} END{for(i = 1; i <= 916427; i ){if(!(i in a)){print i}}}' file
E.g.
# create some example data and remove "10"
seq 916427 | sed '10d' > test.txt
head test.txt
1
2
3
4
5
6
7
8
9
11
awk '{a[$1]} END { for (i = 1; i <= 916427; i ) { if (!(i in a)) {print i}}}' test.txt
10