Home > other >  How can I do this in Awk? I have several files with two columns and identical values in first column
How can I do this in Awk? I have several files with two columns and identical values in first column

Time:04-15

I am an inexperienced Awk user but know that Awk is an efficient choice for processing many files. I would be grateful if someone would please point me in the right direction.

I have a directory called parent. Inside it are more directories named 1, 2, 3, 4, .... Inside each of those directories is a directory called angles. Inside angles is a file called angle_A_B_C.dat, as shown below.

parent
  1
     angles
       angle_A_B_C.dat
  2
     angles
       angle_A_B_C.dat
  3
     angles
       angle_A_B_C.dat
  4
     angles
       angle_A_B_C.dat
  ...

The files angle_A_B_C.dat all have the same number of rows (91) and an identical first column. Only the values in the second column are distinct. Here is an example of one angle_A_B_C.dat file:

# Deg[°]         Angle[A ,B ,C ] 
     1.000        0.0000000000
     3.000        0.0000000000
     5.000        0.0000000000
     7.000        0.0000000000
     9.000        0.0000000000
    11.000        0.0000000000
    13.000        0.0000000000
    15.000        0.0000000000
    17.000        0.0000000000
    19.000        0.0000000000
    21.000        0.0000000000
    23.000        0.0000000000
    25.000        0.0000000000
    27.000        0.0000000000
    29.000        0.0000000000
    31.000        0.0000000000
    33.000        0.0000000000
    35.000        0.0000000000
    37.000        0.0000000000
    39.000        0.0000000000
    41.000        0.0000000000
    43.000        0.0000000000
    45.000        0.0000000000
    47.000        0.0000000000
    49.000        0.0000000000
    51.000        0.0000000000
    53.000        0.0000000000
    55.000        0.0000000000
    57.000        0.0000000000
    59.000        0.0000000000
    61.000        0.0000000000
    63.000        0.0000000000
    65.000        0.0000000000
    67.000        1.0309278351
    69.000        1.0309278351
    71.000        2.0618556701
    73.000        1.0309278351
    75.000        2.0618556701
    77.000        0.0000000000
    79.000        0.0000000000
    81.000        4.1237113402
    83.000        2.0618556701
    85.000        4.1237113402
    87.000        2.0618556701
    89.000        2.0618556701
    91.000        5.1546391753
    93.000        3.0927835052
    95.000        1.0309278351
    97.000        3.0927835052
    99.000        1.0309278351
   101.000        2.0618556701
   103.000        9.2783505155
   105.000        7.2164948454
   107.000        4.1237113402
   109.000        5.1546391753
   111.000        5.1546391753
   113.000        3.0927835052
   115.000        2.0618556701
   117.000        9.2783505155
   119.000        0.0000000000
   121.000        3.0927835052
   123.000        3.0927835052
   125.000        2.0618556701
   127.000        0.0000000000
   129.000        1.0309278351
   131.000        1.0309278351
   133.000        2.0618556701
   135.000        1.0309278351
   137.000        0.0000000000
   139.000        1.0309278351
   141.000        0.0000000000
   143.000        0.0000000000
   145.000        1.0309278351
   147.000        0.0000000000
   149.000        0.0000000000
   151.000        1.0309278351
   153.000        0.0000000000
   155.000        0.0000000000
   157.000        1.0309278351
   159.000        0.0000000000
   161.000        0.0000000000
   163.000        0.0000000000
   165.000        0.0000000000
   167.000        0.0000000000
   169.000        0.0000000000
   171.000        0.0000000000
   173.000        0.0000000000
   175.000        0.0000000000
   177.000        0.0000000000
   179.000        0.0000000000

I want to generate a file called anglesSummary.txt where the first column is the same as in the example above and all of the angle_A_B_C.dat files, and where each row of the second column is the average of the same row from all of the other files.

I roughly recall how to take the average of an entire column that's located in distinct files in distinct directories, but can't figure out how to deal with just one row at a time. Is this possible?

Here is where I am at present; the question marks show where I think I'm stuck.

cd parent
find . -name angle_A_B_C.dat -exec grep "Angle[A ,B ,C ]" {}   > anglesSummary.txt
my_output="$(awk '{ total  = ??? } END { print total/NR }' anglesSummary.txt)"
echo "Average: $my_output" >> anglesSummary.txt

Update (response to markp-fuso comment)

What I want (please see comment at the row where the column 1 value is 15.000):

# Deg[°]         Angle[A ,B ,C ] 
     1.000        0.0000000000
     3.000        0.0000000000
     5.000        0.0000000000
     7.000        0.0000000000
     9.000        0.0000000000
    11.000        0.0000000000
    13.000        0.0000000000
    15.000        1.2222220000 # <--Each row in column 2 is the average of the value in the corresponding row, column 2 in all files. So this particular value (1.222222) is the average of the values in all files where the column 1 value is 15.000.
    17.000        0.0000000000
    19.000        0.0000000000
    21.000        0.0000000000
    23.000        0.0000000000
    25.000        0.0000000000
    27.000        0.0000000000
    29.000        0.0000000000
    31.000        0.0000000000
    33.000        0.0000000000
    35.000        0.0000000000
    ... (truncated)

What I currently get from my code is the average of the average of column 2 in each angle_A_B_C.dat file.

If this is still unclear, please feel free to say so and I will rewrite it. Thank you.

CodePudding user response:

Sample input:

$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°]         Angle[A ,B ,C ]
     1.000        0.3450000000
     3.000        0.4560000000
     5.000        0.7890000000
     7.000        10.0000000000
     9.000        20.0000000000
    11.000        30.0000000000
    13.000        40.0000000000

==> 2/angles/angle_A_B_C.dat <==
# Deg[°]         Angle[A ,B ,C ]
     1.000        7.3450000000
     3.000        8.4560000000
     5.000        9.7890000000
     7.000        17.0000000000
     9.000        27.0000000000
    11.000        37.0000000000
    13.000        47.0000000000

==> 3/angles/angle_A_B_C.dat <==
# Deg[°]         Angle[A ,B ,C ]
     1.000        0.9876000000
     3.000        0.5432000000
     5.000        0.2344560000
     7.000        3.0000000000
     9.000        4.0000000000
    11.000        5.0000000000
    13.000        6.0000000000

One GNU awk idea:

find . -name angle_A_B_C.dat -type f -exec awk '
NR==1   { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }   # 1st record of 1st file => print header
FNR==1  { filecount  ; next }                                 # 1st record of each new file => increment file counter; skip to next input line
        { sums[$1] =$2 }                                      # sum up angles, use 1st column as array index
END     { if (filecount>0) {                                  # eliminate "divide by zero" error if no files found
              PROCINFO["sorted_in"]="@ind_num_asc"            # sort array by numeric index in ascending order
              for (i in sums)                                 # loop through array indices, printing index and average
                  printf "%.3f\t%.10f\n", i, sums[i]/filecount
          }
        }
' {}  

This generates:

# Deg[°]         Angle[A ,B ,C ]
1.000   2.8925333333
3.000   3.1517333333
5.000   3.6041520000
7.000   10.0000000000
9.000   17.0000000000
11.000  24.0000000000
13.000  31.0000000000

NOTES:

  • GNU awk required for PROCINFO["sorted_in"] to allow ouput to be generated in # Deg[°] ascending order (otherwise output could be piped to sort to insure the desired ordering)
  • output formatting can be tweaked to OP's liking by modifying the printf format string

CodePudding user response:

Tested with a 212 MB synthetic version of the concatenated input files, assuming there are a bit over 76,000 individual files, and finishes the entire report in 2.23 seconds

This solution aims at minimizing rounding errors by storing intermediate values in unsigned integers up to 2^53 instead of as double-precision floats, and using most costly string ops to prevent undesirable pre-conversion to floating point.

It also uses a brute-force method to circumvent limitations of certain awks that lack built-in sorting functionality. The upside being - the rows from the input files can be in any chaotic order and it would be fine.

pvE0 <  test.txt \
\
| mawk2 '
  BEGIN {____=\
         _^=_=(_ =  _ _)^--_ --_;_____=!(_=!_)
   } {
      if (/Deg/) { ___[$_]   } else {
           ___[$_____]=$_
           __[$_____] =____*int(______=$NF) \
                         substr(______,index(______,".") _____) 
   } } END { 
     for(_ in ___) { if(index(_,"Deg")) { 
         ______=___[_]
         print _
            break } }
      _________=match(________=sprintf("%.*s",\
                      index(____,!____),____),/$/) 
     for(_=_~_;_-________<=(_________ ________)*\
                         (_________ ________);_  ) {
       if((_______=sprintf("%.*f",_________,_)) in __) { 
           _____=___[_______]
                 sub("[ \t]*[[:digit:]] [.][[:digit:]] $",        
          sprintf("%c%*.*f",--________,  ________ _________,
                    ________,__[_______]/______/____),_____) 
        print _____ } } }' 



  in0:  212MiB 0:00:02 [97.1MiB/s] [97.1MiB/s] 
  [=============================>] 100%   
     
 # Deg[°]         Angle[A ,B ,C ] 
     1.000       7.5148221018
     3.000       7.4967176419
     5.000       7.5160005498
     7.000       7.4793862628
     9.000       7.5123479596
    11.000       7.4791082935
    13.000       7.4858962001
    15.000       7.4941294148
    17.000       7.5150168021
    19.000       7.5067556155
    21.000       7.5146136198
    23.000       7.4792701433
    25.000       7.4801382861
    27.000       7.5026906476
    29.000       7.4802267331
    31.000       7.5216754387
    33.000       7.4892379481
    35.000       7.4905661773
    37.000       7.4759338641
    39.000       7.5130521094
    41.000       7.4923359448
    43.000       7.4680275394
    45.000       7.5131741424
    47.000       7.5022641880
    49.000       7.4865545672
    51.000       7.5280509182
    53.000       7.4982720538
    55.000       7.5082048446
    57.000       7.5034726853
    59.000       7.4978429619
    61.000       7.5055566807
    63.000       7.5108651984
    65.000       7.5211276535
    67.000       7.4875763176
    69.000       7.4993074644
    71.000       7.5124084003
    73.000       7.5321662989
    75.000       7.4859560680
    77.000       7.4700932217
    79.000       7.5121024268
    81.000       7.5180572994
    83.000       7.4938736294
    85.000       7.5073566749
    87.000       7.4917927829
    89.000       7.5142626391
    91.000       7.5223228551
    93.000       7.5168014947
    95.000       7.4757822101
    97.000       7.5141328593
    99.000       7.4863544344
   101.000       7.5036731671
   103.000       7.5200733708
   105.000       7.4964541138
   107.000       7.5050440318
   109.000       7.4890049434
   111.000       7.5045965882
   113.000       7.5119613957
   115.000       7.5050971735
   117.000       7.4983417123
   119.000       7.4867090870
   121.000       7.5047947039
   123.000       7.4837043078
   125.000       7.4995212486
   127.000       7.5111280706
   129.000       7.5092771858
   131.000       7.4977679060
   133.000       7.5278372066
   135.000       7.4794945181
   137.000       7.5152681775
   139.000       7.4954245649
   141.000       7.5099441844
   143.000       7.4945221883
   145.000       7.4860083947
   147.000       7.4848234307
   149.000       7.4932545468
   151.000       7.4937942058
   153.000       7.4657789265
   155.000       7.4947049961
   157.000       7.5113607827
   159.000       7.4978364461
   161.000       7.5031970850
   163.000       7.5017955073
   165.000       7.5187543102
   167.000       7.5064268609
   169.000       7.4985988429
   171.000       7.5438396243
   173.000       7.4917706435
   175.000       7.4589904950
   177.000       7.5072644989
   179.000       7.5176241959
  • Related