Bash Script to Read File, Sort and Print Duplicate Records, and their Identity Number-CodePudding

I have a file containing thousands of records which are grouped into sub-groups based on the first 6-digits of their identity numbers they have in common, but some records are duplicates. I am trying to create a bash script to read in the file, find duplicate records and the identity number they share, and print out the identity numbers and duplicate records under them.

Current-Script:

#!/bin/bash
########## script to find duplicate records & their ID
INPUT="sourceFile.txt"
while read varName; do
  echo "$varName"
  if [ "$varName" = "NEXT" ]; then
    sort $INPUT | uniq -d
    echo "END OF ONE ID-NUMBER IN FILE"
  fi
done < "$INPUT"

Sample INPUT_FILE:

NEXT
123456-
# requesting: displayName
displayName: Alpha Beta
displayName: Charly Delta Echo
displayName: Xerox Yingyang Zenox
displayName: Xerox Yingyang Zenox

NEXT
123999-
# requesting: displayName
displayName: Golf Harvey Indigo
displayName: Jaguar Kingston Lambda
displayName: Alma Nano Matter
displayName: Oxygen Pascal Queen
displayName: Romeo Saint Tropez Unicorn
displayName: Vauxhall Wellignton Woolwhich
displayName: Rodrigo Compton Hilside
displayName: Vauxhall Wellignton Woolwhich
NEXT

DESIRED OUTPUT/ EXPECTED OUPUT:

NEXT
123456-
displayName: Xerox Yingyang Zenox
displayName: Xerox Yingyang Zenox

END OF ONE ID-NUMBER IN FILE

NEXT
123999-
displayName: Vauxhall Wellignton Woolwhich
displayName: Vauxhall Wellignton Woolwhich

Thank you for anticipated ideas and clues.

CodePudding user response：

I have no idea why you want the duplicate lines twice and I do not understand what the line "END OF ONE ID-NUMBER IN FILE" is doing in the middle of the output.

The following displays just the duplicates.

#! /bin/bash

read -r next; unset next
while true; do
  read -r id || break
  read -r comment; unset comment
  dns=()
  while read -r dn; do
    if [[ $dn =~ ^NEXT$ ]]; then
      printf 'NEXT\n'
      printf '%s\n' "$id"
      printf '%s\n' "${dns[@]}" | sort | uniq -d
      break
    else
      dns =("$dn")
    fi
  done
done

If you really want to hard code the name of the input file, you can add the following line in the beginning:

exec < sourceFile.txt

CodePudding user response：

sort obviously sorts the entire file. I would refactor this into a simple Awk script instead.


awk '/^NEXT/ { delete a;
      if(NR>1) { print ""; print "END OF ONE ID-NUMBER IN FILE"; print ""; }
      id=""; print; next }
    id == "" { id = $0; print; next }
    !/^displayName:/ { next }
    $0 in a { print; if (a[$0] == 1) print; }
    { a[$0]   }' sourceFile.txt

This should be reasonably straightforward once you familiarize yourself with the basics of Awk. But in brief, we keep an associative array a where we remember which displayName: lines we have already seen, and when we see a duplicate, we print (the original if it wasn't printed already, and) the latest occurrence.

Some of this is slightly ugly because your requirements are rather unattractive; perhaps a better design would be to print only the actual duplicates with their associated ID number on the same line.

awk '/^NEXT/ { delete a; id=""; next }
    id == "" { id = $0; next }
    !/^displayName:/ { next }
    $0 in a { if(a[$0] == 1) print id ":" $0 }
    { a[$0]   }' sourceFile.txt

The fact that something is duplicated is already sufficient, so we only print the second occurrence of anything within a record.