I have a list of file names in a file called data.tsv. Each row corresponds to the same sample ID and there is a potential of up to 8 files per ID. I need to merge files that end in "_1.[variable extension]" and "_2.[variable extension].
Data below, however stackover flow coverts tabs to spaces - should be tab delimited:
I5_fastq_path_1 I5_fastq_path_2 I5_fastq_path_3 I5_fastq_path_4 I5_fastq_path_5 I5_fastq_path_6 I5_fastq_path_7 I5_fastq_path_8
/some/path/to/directory/PD7597b_1_1.fastq.gz /some/path/to/directory/PD7597b_1_2.fastq.gz /some/path/to/directory/PD7597b_2_1.fastq.gz /some/path/to/directory/PD7597b_2_2.fastq.gz /some/path/to/directory/PD7597b_3_1.fastq.gz /some/path/to/directory/PD7597b_3_2.fastq.gz NA NA
/some/path/to/directory/WTCHG_65902_709501_1.fastq.gz /some/path/to/directory/WTCHG_65902_709501_2.fastq.gz /some/path/to/directory/WTCHG_68106_709501_1.fastq.gz /some/path/to/directory/WTCHG_68106_709501_2.fastq.gz /some/path/to/directory/WTCHG_68107_709501_1.fastq.gz /some/path/to/directory/WTCHG_68107_709501_2.fastq.gz /some/path/to/directory/WTCHG_68108_709501_1.fastq.gz /some/path/to/directory/WTCHG_68108_709501_1.fastq.gz
/some/path/to/directory/WTCHG_65902_702501_1.fastq.gz /some/path/to/directory/WTCHG_65902_702501_2.fastq.gz /some/path/to/directory/WTCHG_68106_702501_1.fastq.gz /some/path/to/directory/WTCHG_68106_702501_2.fastq.gz /some/path/to/directory/WTCHG_68107_702501_1.fastq.gz /some/path/to/directory/WTCHG_68107_702501_2.fastq.gz NA NA
/some/path/to/directory/WTCHG_87945_712502_1.fastq.gz /some/path/to/directory/WTCHG_87945_712502_2.fastq.gz /some/path/to/directory/WTCHG_88506_712502_1.fastq.gz /some/path/to/directory/WTCHG_88506_712502_2.fastq.gz /some/path/to/directory/WTCHG_88507_712502_1.fastq.gz /some/path/to/directory/WTCHG_88507_712502_2.fastq.gz NA NA
/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_1.fq.gz /some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_2.fq.gz /some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_1.fq.gz /some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_2.fq.gz NA NA NA NA
/some/path/to/directory/WTCHG_65902_710501_1.fastq.gz /some/path/to/directory/WTCHG_65902_710501_2.fastq.gz /some/path/to/directory/WTCHG_68106_710501_1.fastq.gz /some/path/to/directory/WTCHG_68106_710501_2.fastq.gz /some/path/to/directory/WTCHG_68107_710501_1.fastq.gz /some/path/to/directory/WTCHG_68107_710501_2.fastq.gz NA NA
/some/path/to/directory/NG178_S1_R1_001.fastq.gz /some/path/to/directory/NG178_S1_R2_001.fastq.gz /some/path/to/directory/NG178_S3_R1_001.fastq.gz /some/path/to/directory/NG178_S3_R2_001.fastq.gz NA NA NA NA
/some/path/to/directory/NG232_S8_R1_001.fastq.gz /some/path/to/directory/NG232_S8_R2_001.fastq.gz /some/path/to/directory/NG232_S2_R1_001.fastq.gz /some/path/to/directory/NG232_S2_R2_001.fastq.gz NA NA NA NA
/some/path/to/directory/NG367_S19_R1_001.fastq.gz /some/path/to/directory/NG367_S19_R2_001.fastq.gz /some/path/to/directory/NG367_S6_R1_001.fastq.gz /some/path/to/directory/NG367_S6_R2_001.fastq.gz NA NA NA NA
/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_1.fq.gz /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_2.fq.gz /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_1.fq.gz /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_2.fq.gz /research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMVCKCCXY_L5_1.fq.gz /research/HGIGData/2022_NGS_BACKUP/IBD/PR0005-A5_HMVCKCCXY_L5_2.fq.gz NA NA 6_NDHE04397-A
If easier, you can run the code in html and when copied to a text editor is tab separated.
<table>
<thead>
<tr>
<td>I5_fastq_path_1</td>
<td>I5_fastq_path_2</td>
<td>I5_fastq_path_3</td>
<td>I5_fastq_path_4</td>
<td>I5_fastq_path_5</td>
<td>I5_fastq_path_6</td>
<td>I5_fastq_path_7</td>
<td>I5_fastq_path_8</td>
</tr>
</thead>
<tr>
<td>/some/path/to/directory/PD7597b_1_1.fastq.gz</td>
<td>/some/path/to/directory/PD7597b_1_2.fastq.gz</td>
<td>/some/path/to/directory/PD7597b_2_1.fastq.gz</td>
<td>/some/path/to/directory/PD7597b_2_2.fastq.gz</td>
<td>/some/path/to/directory/PD7597b_3_1.fastq.gz</td>
<td>/some/path/to/directory/PD7597b_3_2.fastq.gz</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/WTCHG_65902_709501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_65902_709501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68106_709501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68106_709501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68107_709501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68107_709501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68108_709501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68108_709501_1.fastq.gz</td>
</tr>
<tr>
<td>/some/path/to/directory/WTCHG_65902_702501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_65902_702501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68106_702501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68106_702501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68107_702501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68107_702501_2.fastq.gz</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/WTCHG_87945_712502_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_87945_712502_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_88506_712502_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_88506_712502_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_88507_712502_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_88507_712502_2.fastq.gz</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_1.fq.gz</td>
<td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L5_2.fq.gz</td>
<td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_1.fq.gz</td>
<td>/some/path/to/directory/IFS003-W_DHE04956-7_HW3LFCCXX_L6_2.fq.gz</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/WTCHG_65902_710501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_65902_710501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68106_710501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68106_710501_2.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68107_710501_1.fastq.gz</td>
<td>/some/path/to/directory/WTCHG_68107_710501_2.fastq.gz</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/NG178_S1_R1_001.fastq.gz</td>
<td>/some/path/to/directory/NG178_S1_R2_001.fastq.gz</td>
<td>/some/path/to/directory/NG178_S3_R1_001.fastq.gz</td>
<td>/some/path/to/directory/NG178_S3_R2_001.fastq.gz</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/NG232_S8_R1_001.fastq.gz</td>
<td>/some/path/to/directory/NG232_S8_R2_001.fastq.gz</td>
<td>/some/path/to/directory/NG232_S2_R1_001.fastq.gz</td>
<td>/some/path/to/directory/NG232_S2_R2_001.fastq.gz</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/some/path/to/directory/NG367_S19_R1_001.fastq.gz</td>
<td>/some/path/to/directory/NG367_S19_R2_001.fastq.gz</td>
<td>/some/path/to/directory/NG367_S6_R1_001.fastq.gz</td>
<td>/some/path/to/directory/NG367_S6_R2_001.fastq.gz</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_1.fq.gz</td>
<td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTN3CCXY_L1_2.fq.gz</td>
<td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_1.fq.gz</td>
<td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMTVYCCXY_L3_2.fq.gz</td>
<td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMVCKCCXY_L5_1.fq.gz</td>
<td>/research/HGIGData/2022_NGS_BACKUP/IBD/PR0006_NDHE04397-A5-A5_HMVCKCCXY_L5_2.fq.gz</td>
<td>NA</td>
<td>NA</td>
</tr>
</table>
If there are 4 columns per line, then the files in cols 1 and 3 are merged and files in cols 2 and 4 are merged.
If there are 6 columns per line, then files in cols 1, 3, 5 are merged and files in 2, 4, and 6 are merged.
This pattern continues up to 8 columns. Please note not all columns have the same number of files but they are always an even number.
I have managed to do this successfully for 4 columns, but I want a way to do this for 6, and 8, and potentially more, without hard coding it... my hard coded attempt is below:
while read -r line; do
path1=$(echo "$line"| cut -f1)
path2=$(echo "$line"| cut -f2)
path3=$(echo "$line"| cut -f3)
path4=$(echo "$line"| cut -f4)
#extract sample ID and append 'merged'
ID=$(echo "$path1" | cut -d "." -f1 | sed 's#.*/##' | sed 's/_1/_merged/')
#merge files
cat ${path1} ${path3} > /research/merged_fq/${ID}_1.fq.gz
cat ${path2} ${path4} > /research/merged_fq/${ID}_2.fq.gz
#create file with new file paths in
a=$(echo "/research/merged_fq/${ID}_1.fq.gz")
b=$(echo "/research/merged_fq/${ID}_2.fq.gz")
echo "$a" "$b" >>new_files
done < data.tsv
CodePudding user response:
Fixed all your script and added comments everywhere, so you can understand how it works in the details:
#!/usr/bin/env bash
# Path where to store merged fq files
mergePath='/research/merged_fq'
{
# Print header for merged files tsv
printf 'MergedOddFile\tMergedEvenFile\n'
# Read in dummy variable to skip header row of data.tsv
read -r _
# Iterate reading rows of data until End Of File
while read -r row; do
# Map row fields to files array
read -ra files <<<"$row"
# Initialize arrays for odd and even files
oddFiles=()
evenFiles=()
# Extract sample ID _merged, suffix for odd and even fastq files
# Remove path to get file name only
oddBaseName="${files[0]##*/}"
# Remove everything from after the last '_' and add '_merged'
oddID="${oddBaseName%_*}_merged"
# Remove everything until the last '_'
oddSuffix="${oddBaseName##*_}"
# Strip the all the .extension like .fq.gz
oddSuffix="${oddSuffix%%.*}"
# Same for even
evenBaseName="${files[1]##*/}"
evenID="${evenBaseName%_*}_merged"
evenSuffix="${evenBaseName##*_}"
evenSuffix="${evenSuffix%%.*}"
# Iterate the files array indexes
for i in "${!files[@]}"; do
# Stop processing files when name is 'NA' or empty
[[ 'NA' == "${files[i]}" || -z "${files[i]}" ]] && break
# Add file to corresponding merge array
if ((i % 2)); then
evenFiles =("${files[i]}")
else
oddFiles =("${files[i]}")
fi
done
# Compose merged file names
oddMerge="${mergePath}/${oddID}_${oddSuffix}.fq.gz"
evenMerge="${mergePath}/${evenID}_${evenSuffix}.fq.gz"
# Merge odd and even files separately
cat "${oddFiles[@]}" > "$oddMerge"
cat "${evenFiles[@]}" > "$evenMerge"
# Log merged files
printf '%s\t%s\n' "$oddMerge" "$evenMerge"
done
} < data.tsv > merged_files.tsv