Home > Net >  Loop over a number of files to carry out filtering
Loop over a number of files to carry out filtering

Time:10-30

I have a list of files (15 in total) with different filenames all in the same format except the 4th world highlighted in bold.

Late_Tox_GWAS.**TelangiectasiaG1**_resid.glm.linear
Late_Tox_GWAS.**AtrophyG1**_resid.glm.linear
Late_Tox_GWAS.**AtrophyG2**_resid.glm.linear
Late_Tox_GWAS.**IndurationG1**_resid.glm.linear

etc...

All of these files are located in /scrat/genome/hj86/Late_Tox_GWAS/*_resid.glm.linear

All of these files have the same number of columns with same column names. I want to filter all of these files on column 7 for 'ADD'. I have ran this sample command to filter on column 7 so that this occurs for every file which should then save into separate corresponding files e.g. the next one would be AtrophyG1_resid_ADD and the next one would be AtrophyG2_resid_ADD etc...

I am new to loops and didn't know how to code so that each individual toxicity is read in and didn't know how to specify the unique bit of the file name so that each file gets processed and the results are saved to the corresponding unique file name. I would be grateful for any help.

#!/bin/bash
#PBS -N Late_Tox_regression_ADD
#PBS -l walltime=01:00:00
#PBS -l nodes=1:ppn=8
#PBS -l vmem=16gb
#PBS -m bea
#PBS -M my email address
set -x


for fname in /scrat/genome/hj86/Late_Tox_GWAS/*_resid.glm.linear

do
    tox="${fname#*.}"                      
    tox="${tox%%_*}"                         

    awk 'NR==1 || $7 == "ADD"{print}' "${fname}" > "${tox}_resid_ADD"

done


I dont get any output just a file that says:

  for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
  tox=AtrophyG1_resid.glm.linear
  tox=AtrophyG1
  awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.AtrophyG1_resid.glm.linear
  for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
  tox=AtrophyG2_resid.glm.linear
  tox=AtrophyG2
  awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.AtrophyG2_resid.glm.linear
  for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
  tox=IndurationG1_resid.glm.linear
  tox=IndurationG1
  awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.IndurationG1_resid.glm.linear
  for fname in '/scratch/genomeqol/hkj7/Late_Tox_GWAS/*_resid.glm.linear'
  tox=Induration_G2_resid.glm.linear
  tox=Induration
  awk 'NR==1 || $7 == "ADD"{print}' /scratch/genomeqol/hkj7/Late_Tox_GWAS/Late_Tox_GWAS.Induration_G2_resid.glm.linear

CodePudding user response:

A bit confused as to where the files are located and how many we're processing:

  • FILEPATH/${tox}/*.glm.linear* would seem to indicate there's a separate sub-directory for each ${tox} but possibly several files in said sub-directory
  • for entry in FILEPATH/${tox}/*.glm.linear* would seem to imply there may be several files to process in this one directory (FILEPATH/${tox}) but entry is never referenced anywhere else in the code so ...
  • we could end up processing the file named Late_Tox_GWAS.{tox}_resid.glm.linear multiple times (ie, once for each entry=*.glm.linear* file)

Assumptions:

  • OP knows how to locate the list of files to be processed (for the sample code I'm going to use a find command as an example)
  • all output is written to the 'current' directory (otherwise the sample code can be modified to write to the correct directory)

One idea using parameter substitutions to pull the desired string from the file name, then using this to run OP's awk script:

while read -r fname
do
    tox="${fname#*.}"                        # strip off all characters from the front of the string up to and including the first "."
    tox="${tox%%_*}"                         # strip off all characters from the first "_" to the end of the string

    awk 'NR==1 || $7 == "ADD"{print}' "${fname}" > "${tox}_resid_ADD"

done < <(find FILEPATH -name "*.glm.linear" -type f)

In my environment where I replaced FILEPATH with dir3/sdir2 (the location of 4x *.glm.linear files) this code executed the following commands:

awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.AtrophyG1_resid.glm.linear > AtrophyG1_resid_ADD
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.AtrophyG2_resid.glm.linear > AtrophyG2_resid_ADD
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.IndurationG1_resid.glm.linear > IndurationG1_resid_ADD
awk NR==1 || $7 == "ADD"{print} dir3/sdir2/Late_Tox_GWAS.TelangiectasiaG1_resid.glm.linear > TelangiectasiaG1_resid_AD

Resulting in the following files being created in my current directory:

$ ls -1 *resid*ADD
AtrophyG1_resid_ADD
AtrophyG2_resid_ADD
IndurationG1_resid_ADD
TelangiectasiaG1_resid_ADD
  • Related