Compare multiple tsv files for match columns-CodePudding

I have 4466 .tsv files with this structure: file_structure

I want to compare the 4466 files to see how many IDs (the first column) matches. I only found bash commands with two files with "comm". Could you tell me how I could do that?

Thank you

CodePudding user response：

I read your question as:

Amongst all TSV files, which column IDs are found in every file?

If that's true, we want the intersection of all the sets of column IDs from all files. We can use the join command to get the intersection of any two files, and we can use the algebraic properites of an intersection to effectively join all files.

Consider the intersection of ID for these three files:

file1.tsv   file2.tsv   file3.tsv
---------   ---------   ---------
ID          ID          ID       
1           1           2        
2           3           3        
3

"3" is the only ID shared between all three. We can only join two files together at a time, so we need some way to effectively get, join (join file1.tsv file2.tsv) file3.tsv. Fortunately for us intersections are idempotent and associative, so we can apply join iteratively in a loop over all the files, like so:

# "Prime" the common file
cp file1.tsv common.tsv

for TSV in file*.tsv; do
    join "$TSV" common.tsv > myTmp
    mv myTmp common.tsv

    echo "After joining $TSV, common IDs are:"
    cat common.tsv
done

When I run that it prints the following:

After joining file1.tsv, common IDs are:
ID
1
2
3
After joining file2.tsv, common IDs are:
ID
1
3
After joining file3.tsv, common IDs are:
ID
3

The first iteration joins file1 with itself (because we primed common with file1); this is where we intersection to be idempotent
The second iteration joins in file2, cutting out ID "2"
The third iteration joins in file3, cutting ID down to just "3"

Technically, join considers the string "ID" to be one of the things to evaluate... it doesn't know what a header line is, or an what an ID is... it just knows to look in the some number of fields for common values. In that example we didn't specify a field so it defaulted to the first field, and it always found "ID" and it always found "3".

For your files, we need to tell join to:

separate on a tab character, with -t <TAB-CHAR>
only output the join field (which, by default, is the first field), with -o 0

Here's my full implementation:

#!/bin/sh
TAB="$(printf '\t')"

# myJoin joins tsvX with the previously-joined common on 
# the first field of both files; saving the the first field
# of the joined output back into common
myJoin() {
    tsvX="$1"

    join -t "$TAB" -o 0 common.tsv "$tsvX" > myTmp.tsv
    mv myTmp.tsv common.tsv
}

# "Prime" common
cp input1.tsv common.tsv

for TSV in input*.tsv; do
    myJoin "$TSV"
done

echo "The common IDs are:"
tail -n -1 common.tsv

For an explanation of why "$(printf '\t')", check out the following for POSIX compliance:

CodePudding user response：

The question sounds quite vague. So, assuming that you want to extract IDs that all 4466 files have in common, i.e. IDs such that each of them occurs at least once in all of the *.tsv files, you can do this (e.g.) in pure Bash using associative arrays and calculating “set intersections” on them.

#!/bin/bash

# removes all IDs from array $1 that do not occur in array $2.
intersect_ids() {
  local -n acc="$1"
  local -rn operand="$2"
  local id
  for id in "${!acc[@]}"; do
    ((operand["$id"])) || unset "acc['${id}']"
  done
}

# prints IDs that occur in all files called *.tsv in directory $1.
get_ids_intersection() (
  shopt -s nullglob
  local -ar files=("${1}/"*.tsv)
  local -Ai common_ids next_ids
  local file id _
  if ((${#files[@]})); then
    while read -r id _; do ((  common_ids["$id"])); done < "${files[0]}"
    for file in "${files[@]:1}"; do
      while read -r id _; do ((  next_ids["$id"])); done < "$file"
      intersect_ids common_ids next_ids
      next_ids=()
    done
  fi
  for id in "${!common_ids[@]}"; do printf '%s\n' "$id"; done
)

get_ids_intersection /directory/where/tsv/files/are

CodePudding user response：

Assuming you said "how many IDs (the first column) matches" means you want to find out which IDs have duplicated row in these files.
Try rq (https://github.com/fuyuncat/rquery/releases), one simple command line can do it.

[ rquery]$ ./rq -q "p d/\t/ | s @1, count(1) | g @1 | e @2>1" samples/*.tsv
M00002  3
M00003  2
M00006  2