Home > Back-end >  How to combine all files in a directory, adding their individual file names as a new column in final
How to combine all files in a directory, adding their individual file names as a new column in final

Time:08-18

I have a directory with files that looks like this:

CCG02-215-WGS.format.flt.txt
CCG05-707-WGS.format.flt.txt
CCG06-203-WGS.format.flt.txt
CCG04-967-WGS.format.flt.txt
CCG05-710-WGS.format.flt.txt
CCG06-215-WGS.format.flt.txt

Contents of each files look like this

1   9061390 14  93246140
1   58631131    2   31823410
1   108952511   3   110694548
1   168056494   19  23850376
etc...

Ideal output would be a file, let's call it all-samples.format.flt.txt, that would contain the concatenation of all files, but an additional column that displays which sample/file the row came from ( some minor formatting involved to remove the .format.flt.txt ):

1   9061390 14  93246140    CCG02-215-WGS
...
1   58631131    2   31823410    CCG05-707-WGS
...
1   108952511   3   110694548   CCG06-203-WGS
...
1   168056494   19  23850376    CCG04-967-WGS

Currently, I have the following code which works for individual files.

awk 'BEGIN{OFS="\t"; split(ARGV[1],f,".")}{print $1,$2,$3,$4,f[1]}' CCG05-707-WGS.format.flt.txt

#OUTPUT

1   58631131    2   31823410    CCG05-707-WGS
...

However, when I try to apply it to all files, using the star, it adds the first filename it finds to all the files as the 4th column.

awk 'BEGIN{OFS="\t"; split(ARGV[1],f,".")}{print $1,$2,$3,$4,f[1]}' *

#OUTPUT, 4th column should be as seen in previous code block

1   9061390 14  93246140    CCG02-215-WGS
...
1   58631131    2   31823410    CCG02-215-WGS
...
1   108952511   3   110694548   CCG02-215-WGS
...
1   168056494   19  23850376    CCG02-215-WGS

I feel like the solution may just lie in adding an additional parameter to awk... but I'm not sure where to start.

Thanks!

UPDATE

Using OOTB awk var FILENAME solved the issue, plus some elegant formatting logic for the file names.

Thank @RavinderSingh13!

awk 'BEGIN{OFS="\t"} FNR==1{file=FILENAME;sub(/..*/,"",file)} {print $0,file}' *.txt

CodePudding user response:

With your shown samples please try following awk code. We need to use FILENAME OOTB variable here of awk. Then whenever there is first line of any txt file(all txt files passed to this program) then remove everything from . to till last of value and in main program printing current line followed by file(file's name as per requirement)

awk '
BEGIN { OFS="\t" }
FNR==1{
  file=FILENAME
  sub(/\..*/,"",file)
}
{
  print $0,file
}
' *.txt

OR in a one-liner form try following awk code:

awk 'BEGIN{OFS="\t"} FNR==1{file=FILENAME;sub(/\..*/,"",file)} {print $0,file}' *.txt

CodePudding user response:

You may use:

Any version awk:

awk -v OFS='\t' 'FNR==1{split(FILENAME, a, /\./)} {print $0, a[1]}' *.txt

Or in gnu-awk:

awk -v OFS='\t' 'BEGINFILE{split(FILENAME, a, /\./)} {print $0, a[1]}' *.txt
  • Related