I have a list of files (15 in total) with different filenames all in the same format except the 4th world highlighted in bold.
Late_Tox_GWAS.**TelangiectasiaG1**_resid.glm.linear
Late_Tox_GWAS.**AtrophyG1**_resid.glm.linear
Late_Tox_GWAS.**AtrophyG2**_resid.glm.linear
Late_Tox_GWAS.**IndurationG1**_resid.glm.linear
etc...
All of these files are located in /scrat/genome/hj86/Late_Tox_GWAS/*_resid.glm.linear
All of these files have the same number of columns with same column names. I want to filter all of these files on column 7 for 'ADD'. I have ran this sample command to filter on column 7 so that this occurs for every file which should then save into separate corresponding files e.g. the next one would be AtrophyG1_resid_ADD and the next one would be AtrophyG2_resid_ADD etc...
I am new to loops and didn't know how to code so that each individual toxicity is read in and didn't know how to specify the unique bit of the file name so that each file gets processed and the results are saved to the corresponding unique file name. I would be grateful for any help.
#!/bin/bash
#PBS -N Late_Tox_regression_ADD
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=8
#PBS -l vmem=16gb
#PBS -m bea
#PBS -M my email address
set -x
for fname in /scrat/genome/hj86/Late_Tox_GWAS/*_resid.glm.linear
do
tox="${fname#*.}"
tox="${tox%%_*}"
awk 'NR==1 || $7 == "ADD"{print}' "${fname}" > "${tox}_resid_ADD"
done
I dont get any output just a file that says:
for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
tox=AtrophyG1_resid.glm.linear
tox=AtrophyG1
awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.AtrophyG1_resid.glm.linear
for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
tox=AtrophyG2_resid.glm.linear
tox=AtrophyG2
awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.AtrophyG2_resid.glm.linear
for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
tox=IndurationG1_resid.glm.linear
tox=IndurationG1
awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.IndurationG1_resid.glm.linear
for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
tox=Induration_G2_resid.glm.linear
tox=Induration
awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.Induration_G2_resid.glm.linear
CodePudding user response:
A bit confused as to where the files are located and how many we're processing:
FILEPATH/${tox}/*.glm.linear*
would seem to indicate there's a separate sub-directory for each${tox}
but possibly several files in said sub-directoryfor entry in FILEPATH/${tox}/*.glm.linear*
would seem to imply there may be several files to process in this one directory (FILEPATH/${tox}
) butentry
is never referenced anywhere else in the code so ...- we could end up processing the file named
Late_Tox_GWAS.{tox}_resid.glm.linear
multiple times (ie, once for eachentry=*.glm.linear*
file)
Assumptions:
- OP knows how to locate the list of files to be processed (for the sample code I'm going to use a
find
command as an example) - all output is written to the 'current' directory (otherwise the sample code can be modified to write to the correct directory)
One idea using parameter substitutions to pull the desired string from the file name, then using this to run OP's awk
script:
while read -r fname
do
tox="${fname#*.}" # strip off all characters from the front of the string up to and including the first "."
tox="${tox%%_*}" # strip off all characters from the first "_" to the end of the string
awk 'NR==1 || $7 == "ADD"{print}' "${fname}" > "${tox}_resid_ADD"
done < <(find FILEPATH -name "*.glm.linear" -type f)
In my environment where I replaced FILEPATH
with dir3/sdir2
(the location of 4x *.glm.linear
files) this code executed the following commands:
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.AtrophyG1_resid.glm.linear > AtrophyG1_resid_ADD
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.AtrophyG2_resid.glm.linear > AtrophyG2_resid_ADD
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.IndurationG1_resid.glm.linear > IndurationG1_resid_ADD
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.TelangiectasiaG1_resid.glm.linear > TelangiectasiaG1_resid_AD
Resulting in the following files being created in my current directory:
$ ls -1 *resid*ADD
AtrophyG1_resid_ADD
AtrophyG2_resid_ADD
IndurationG1_resid_ADD
TelangiectasiaG1_resid_ADD