Concatenate files from list of files in specific cases-CodePudding

I have a list of file names in a file called data.tsv. Each row corresponds to the same sample ID and there is a potential of up to 8 files per ID. I need to merge files that end in "_1.[variable extension]" and "_2.[variable extension].

Data below, however stackover flow coverts tabs to spaces - should be tab delimited:

I5_fastq_path_1 I5_fastq_path_2 I5_fastq_path_3 I5_fastq_path_4 I5_fastq_path_5 I5_fastq_path_6 I5_fastq_path_7 I5_fastq_path_8                             
/some/path/to/directory/PD7597b_1_1.fastq.gz    /some/path/to/directory/PD7597b_1_2.fastq.gz    /some/path/to/directory/PD7597b_2_1.fastq.gz    /some/path/to/directory/PD7597b_2_2.fastq.gz    /some/path/to/directory/PD7597b_3_1.fastq.gz    /some/path/to/directory/PD7597b_3_2.fastq.gz    NA  NA                              
/some/path/to/directory/WTCHG_65902_709501_1.fastq.gz   /some/path/to/directory/WTCHG_65902_709501_2.fastq.gz   /some/path/to/directory/WTCHG_68106_709501_1.fastq.gz   /some/path/to/directory/WTCHG_68106_709501_2.fastq.gz   /some/path/to/directory/WTCHG_68107_709501_1.fastq.gz   /some/path/to/directory/WTCHG_68107_709501_2.fastq.gz   /some/path/to/directory/WTCHG_68108_709501_1.fastq.gz   /some/path/to/directory/WTCHG_68108_709501_1.fastq.gz                               
/some/path/to/directory/WTCHG_65902_702501_1.fastq.gz   /some/path/to/directory/WTCHG_65902_702501_2.fastq.gz   /some/path/to/directory/WTCHG_68106_702501_1.fastq.gz   /some/path/to/directory/WTCHG_68106_702501_2.fastq.gz   /some/path/to/directory/WTCHG_68107_702501_1.fastq.gz   /some/path/to/directory/WTCHG_68107_702501_2.fastq.gz   NA  NA                              
/some/path/to/directory/WTCHG_87945_712502_1.fastq.gz   /some/path/to/directory/WTCHG_87945_712502_2.fastq.gz   /some/path/to/directory/WTCHG_88506_712502_1.fastq.gz   /some/path/to/directory/WTCHG_88506_712502_2.fastq.gz   /some/path/to/directory/WTCHG_88507_712502_1.fastq.gz   /some/path/to/directory/WTCHG_88507_712502_2.fastq.gz   NA  NA                              
/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_1.fq.gz    /some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_2.fq.gz    /some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_1.fq.gz    /some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_2.fq.gz    NA  NA  NA  NA                              
/some/path/to/directory/WTCHG_65902_710501_1.fastq.gz   /some/path/to/directory/WTCHG_65902_710501_2.fastq.gz   /some/path/to/directory/WTCHG_68106_710501_1.fastq.gz   /some/path/to/directory/WTCHG_68106_710501_2.fastq.gz   /some/path/to/directory/WTCHG_68107_710501_1.fastq.gz   /some/path/to/directory/WTCHG_68107_710501_2.fastq.gz   NA  NA                              
/some/path/to/directory/NG178_S1_R1_001.fastq.gz    /some/path/to/directory/NG178_S1_R2_001.fastq.gz    /some/path/to/directory/NG178_S3_R1_001.fastq.gz    /some/path/to/directory/NG178_S3_R2_001.fastq.gz    NA  NA  NA  NA                              
/some/path/to/directory/NG232_S8_R1_001.fastq.gz    /some/path/to/directory/NG232_S8_R2_001.fastq.gz    /some/path/to/directory/NG232_S2_R1_001.fastq.gz    /some/path/to/directory/NG232_S2_R2_001.fastq.gz    NA  NA  NA  NA                              
/some/path/to/directory/NG367_S19_R1_001.fastq.gz   /some/path/to/directory/NG367_S19_R2_001.fastq.gz   /some/path/to/directory/NG367_S6_R1_001.fastq.gz    /some/path/to/directory/NG367_S6_R2_001.fastq.gz    NA  NA  NA  NA                              
/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_1.fq.gz  /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_2.fq.gz  /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_1.fq.gz  /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_2.fq.gz  /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMVCKCCXY_L5_1.fq.gz  /research/HGIGData/2022_NGS_BACKUP/IBD/PR0005-A5_HMVCKCCXY_L5_2.fq.gz   NA  NA                              6_NDHE04397-A

If easier, you can run the code in html and when copied to a text editor is tab separated.

<table>
    <thead>
        <tr>
            <td>I5_fastq_path_1</td>
            <td>I5_fastq_path_2</td>
            <td>I5_fastq_path_3</td>
            <td>I5_fastq_path_4</td>
            <td>I5_fastq_path_5</td>
            <td>I5_fastq_path_6</td>
            <td>I5_fastq_path_7</td>
            <td>I5_fastq_path_8</td>
        </tr>
    </thead>
    <tr>
        <td>/some/path/to/directory/PD7597b_1_1.fastq.gz</td>
        <td>/some/path/to/directory/PD7597b_1_2.fastq.gz</td>
        <td>/some/path/to/directory/PD7597b_2_1.fastq.gz</td>
        <td>/some/path/to/directory/PD7597b_2_2.fastq.gz</td>
        <td>/some/path/to/directory/PD7597b_3_1.fastq.gz</td>
        <td>/some/path/to/directory/PD7597b_3_2.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/WTCHG_65902_709501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_65902_709501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68106_709501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68106_709501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68107_709501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68107_709501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68108_709501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68108_709501_1.fastq.gz</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/WTCHG_65902_702501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_65902_702501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68106_702501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68106_702501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68107_702501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68107_702501_2.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/WTCHG_87945_712502_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_87945_712502_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_88506_712502_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_88506_712502_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_88507_712502_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_88507_712502_2.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_1.fq.gz</td>
        <td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_2.fq.gz</td>
        <td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_1.fq.gz</td>
        <td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_2.fq.gz</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/WTCHG_65902_710501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_65902_710501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68106_710501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68106_710501_2.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68107_710501_1.fastq.gz</td>
        <td>/some/path/to/directory/WTCHG_68107_710501_2.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/NG178_S1_R1_001.fastq.gz</td>
        <td>/some/path/to/directory/NG178_S1_R2_001.fastq.gz</td>
        <td>/some/path/to/directory/NG178_S3_R1_001.fastq.gz</td>
        <td>/some/path/to/directory/NG178_S3_R2_001.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/NG232_S8_R1_001.fastq.gz</td>
        <td>/some/path/to/directory/NG232_S8_R2_001.fastq.gz</td>
        <td>/some/path/to/directory/NG232_S2_R1_001.fastq.gz</td>
        <td>/some/path/to/directory/NG232_S2_R2_001.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/some/path/to/directory/NG367_S19_R1_001.fastq.gz</td>
        <td>/some/path/to/directory/NG367_S19_R2_001.fastq.gz</td>
        <td>/some/path/to/directory/NG367_S6_R1_001.fastq.gz</td>
        <td>/some/path/to/directory/NG367_S6_R2_001.fastq.gz</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_1.fq.gz</td>
        <td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_2.fq.gz</td>
        <td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_1.fq.gz</td>
        <td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_2.fq.gz</td>
        <td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMVCKCCXY_L5_1.fq.gz</td>
        <td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMVCKCCXY_L5_2.fq.gz</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
</table>

If there are 4 columns per line, then the files in cols 1 and 3 are merged and files in cols 2 and 4 are merged.

If there are 6 columns per line, then files in cols 1, 3, 5 are merged and files in 2, 4, and 6 are merged.

This pattern continues up to 8 columns. Please note not all columns have the same number of files but they are always an even number.

I have managed to do this successfully for 4 columns, but I want a way to do this for 6, and 8, and potentially more, without hard coding it... my hard coded attempt is below:

while read -r line; do
    path1=$(echo "$line"| cut -f1)
    path2=$(echo "$line"| cut -f2)
    path3=$(echo "$line"| cut -f3)
    path4=$(echo "$line"| cut -f4)

    #extract sample ID and append 'merged'
    ID=$(echo "$path1" | cut -d "." -f1 | sed 's#.*/##' | sed 's/_1/_merged/')

    #merge files
cat ${path1} ${path3} > /research/merged_fq/${ID}_1.fq.gz
cat ${path2} ${path4} > /research/merged_fq/${ID}_2.fq.gz

#create file with new file paths in
a=$(echo "/research/merged_fq/${ID}_1.fq.gz")
b=$(echo "/research/merged_fq/${ID}_2.fq.gz")
echo "$a" "$b" >>new_files
done < data.tsv

CodePudding user response：

Fixed all your script and added comments everywhere, so you can understand how it works in the details:

#!/usr/bin/env bash

# Path where to store merged fq files
mergePath='/research/merged_fq'

{
  # Print header for merged files tsv
  printf 'MergedOddFile\tMergedEvenFile\n'

  # Read in dummy variable to skip header row of data.tsv
  read -r _

  # Iterate reading rows of data until End Of File
  while read -r row; do
    # Map row fields to files array
    read -ra files <<<"$row"

    # Initialize arrays for odd and even files
    oddFiles=()
    evenFiles=()

    # Extract sample ID _merged, suffix for odd and even fastq files

    # Remove path to get file name only
    oddBaseName="${files[0]##*/}"

    # Remove everything from after the last '_' and add '_merged'
    oddID="${oddBaseName%_*}_merged"

    # Remove everything until the last '_'
    oddSuffix="${oddBaseName##*_}"

    # Strip the all the .extension like .fq.gz
    oddSuffix="${oddSuffix%%.*}"

    # Same for even
    evenBaseName="${files[1]##*/}"
    evenID="${evenBaseName%_*}_merged"
    evenSuffix="${evenBaseName##*_}"
    evenSuffix="${evenSuffix%%.*}"

    # Iterate the files array indexes
    for i in "${!files[@]}"; do

      # Stop processing files when name is 'NA' or empty
      [[ 'NA' == "${files[i]}" || -z "${files[i]}" ]] && break

      # Add file to corresponding merge array
      if ((i % 2)); then
        evenFiles =("${files[i]}")
      else
        oddFiles =("${files[i]}")
      fi
    done

    # Compose merged file names
    oddMerge="${mergePath}/${oddID}_${oddSuffix}.fq.gz"
    evenMerge="${mergePath}/${evenID}_${evenSuffix}.fq.gz"

    # Merge odd and even files separately
    cat "${oddFiles[@]}" > "$oddMerge"
    cat "${evenFiles[@]}" > "$evenMerge"

    # Log merged files
    printf '%s\t%s\n' "$oddMerge" "$evenMerge"
  done
} < data.tsv > merged_files.tsv