I am an inexperienced Awk user but know that Awk is an efficient choice for processing many files. I would be grateful if someone would please point me in the right direction.
I have a directory called parent
. Inside it are more directories named 1, 2, 3, 4, ...
. Inside each of those directories is a directory called angles
. Inside angles
is a file called angle_A_B_C.dat
, as shown below.
parent
1
angles
angle_A_B_C.dat
2
angles
angle_A_B_C.dat
3
angles
angle_A_B_C.dat
4
angles
angle_A_B_C.dat
...
The files angle_A_B_C.dat
all have the same number of rows (91) and an identical first column. Only the values in the second column are distinct. Here is an example of one angle_A_B_C.dat
file:
# Deg[°] Angle[A ,B ,C ]
1.000 0.0000000000
3.000 0.0000000000
5.000 0.0000000000
7.000 0.0000000000
9.000 0.0000000000
11.000 0.0000000000
13.000 0.0000000000
15.000 0.0000000000
17.000 0.0000000000
19.000 0.0000000000
21.000 0.0000000000
23.000 0.0000000000
25.000 0.0000000000
27.000 0.0000000000
29.000 0.0000000000
31.000 0.0000000000
33.000 0.0000000000
35.000 0.0000000000
37.000 0.0000000000
39.000 0.0000000000
41.000 0.0000000000
43.000 0.0000000000
45.000 0.0000000000
47.000 0.0000000000
49.000 0.0000000000
51.000 0.0000000000
53.000 0.0000000000
55.000 0.0000000000
57.000 0.0000000000
59.000 0.0000000000
61.000 0.0000000000
63.000 0.0000000000
65.000 0.0000000000
67.000 1.0309278351
69.000 1.0309278351
71.000 2.0618556701
73.000 1.0309278351
75.000 2.0618556701
77.000 0.0000000000
79.000 0.0000000000
81.000 4.1237113402
83.000 2.0618556701
85.000 4.1237113402
87.000 2.0618556701
89.000 2.0618556701
91.000 5.1546391753
93.000 3.0927835052
95.000 1.0309278351
97.000 3.0927835052
99.000 1.0309278351
101.000 2.0618556701
103.000 9.2783505155
105.000 7.2164948454
107.000 4.1237113402
109.000 5.1546391753
111.000 5.1546391753
113.000 3.0927835052
115.000 2.0618556701
117.000 9.2783505155
119.000 0.0000000000
121.000 3.0927835052
123.000 3.0927835052
125.000 2.0618556701
127.000 0.0000000000
129.000 1.0309278351
131.000 1.0309278351
133.000 2.0618556701
135.000 1.0309278351
137.000 0.0000000000
139.000 1.0309278351
141.000 0.0000000000
143.000 0.0000000000
145.000 1.0309278351
147.000 0.0000000000
149.000 0.0000000000
151.000 1.0309278351
153.000 0.0000000000
155.000 0.0000000000
157.000 1.0309278351
159.000 0.0000000000
161.000 0.0000000000
163.000 0.0000000000
165.000 0.0000000000
167.000 0.0000000000
169.000 0.0000000000
171.000 0.0000000000
173.000 0.0000000000
175.000 0.0000000000
177.000 0.0000000000
179.000 0.0000000000
I want to generate a file called anglesSummary.txt
where the first column is the same as in the example above and all of the angle_A_B_C.dat
files, and where each row of the second column is the average of the same row from all of the other files.
I roughly recall how to take the average of an entire column that's located in distinct files in distinct directories, but can't figure out how to deal with just one row at a time. Is this possible?
Here is where I am at present; the question marks show where I think I'm stuck.
cd parent
find . -name angle_A_B_C.dat -exec grep "Angle[A ,B ,C ]" {} > anglesSummary.txt
my_output="$(awk '{ total = ??? } END { print total/NR }' anglesSummary.txt)"
echo "Average: $my_output" >> anglesSummary.txt
Update (response to markp-fuso comment)
What I want (please see comment at the row where the column 1 value is 15.000):
# Deg[°] Angle[A ,B ,C ]
1.000 0.0000000000
3.000 0.0000000000
5.000 0.0000000000
7.000 0.0000000000
9.000 0.0000000000
11.000 0.0000000000
13.000 0.0000000000
15.000 1.2222220000 # <--Each row in column 2 is the average of the value in the corresponding row, column 2 in all files. So this particular value (1.222222) is the average of the values in all files where the column 1 value is 15.000.
17.000 0.0000000000
19.000 0.0000000000
21.000 0.0000000000
23.000 0.0000000000
25.000 0.0000000000
27.000 0.0000000000
29.000 0.0000000000
31.000 0.0000000000
33.000 0.0000000000
35.000 0.0000000000
... (truncated)
What I currently get from my code is the average of the average of column 2 in each angle_A_B_C.dat file.
If this is still unclear, please feel free to say so and I will rewrite it. Thank you.
CodePudding user response:
Sample input:
$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.3450000000
3.000 0.4560000000
5.000 0.7890000000
7.000 10.0000000000
9.000 20.0000000000
11.000 30.0000000000
13.000 40.0000000000
==> 2/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 7.3450000000
3.000 8.4560000000
5.000 9.7890000000
7.000 17.0000000000
9.000 27.0000000000
11.000 37.0000000000
13.000 47.0000000000
==> 3/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.9876000000
3.000 0.5432000000
5.000 0.2344560000
7.000 3.0000000000
9.000 4.0000000000
11.000 5.0000000000
13.000 6.0000000000
One GNU awk
idea:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" } # 1st record of 1st file => print header
FNR==1 { filecount ; next } # 1st record of each new file => increment file counter; skip to next input line
{ sums[$1] =$2 } # sum up angles, use 1st column as array index
END { if (filecount>0) { # eliminate "divide by zero" error if no files found
PROCINFO["sorted_in"]="@ind_num_asc" # sort array by numeric index in ascending order
for (i in sums) # loop through array indices, printing index and average
printf "%.3f\t%.10f\n", i, sums[i]/filecount
}
}
' {}
This generates:
# Deg[°] Angle[A ,B ,C ]
1.000 2.8925333333
3.000 3.1517333333
5.000 3.6041520000
7.000 10.0000000000
9.000 17.0000000000
11.000 24.0000000000
13.000 31.0000000000
NOTES:
GNU awk
required forPROCINFO["sorted_in"]
to allow ouput to be generated in# Deg[°]
ascending order (otherwise output could be piped tosort
to insure the desired ordering)- output formatting can be tweaked to OP's liking by modifying the
printf
format string
CodePudding user response:
Tested with a 212 MB
synthetic version of the concatenated input files, assuming there are a bit over 76,000 individual files
, and finishes the entire report in 2.23 seconds
This solution aims at minimizing rounding errors by storing intermediate values in unsigned integers up to 2^53
instead of as double-precision floats, and using most costly string ops to prevent undesirable pre-conversion to floating point.
It also uses a brute-force method to circumvent limitations of certain awk
s that lack built-in sorting functionality. The upside being - the rows from the input files can be in any chaotic order and it would be fine.
pvE0 < test.txt \
\
| mawk2 '
BEGIN {____=\
_^=_=(_ = _ _)^--_ --_;_____=!(_=!_)
} {
if (/Deg/) { ___[$_] } else {
___[$_____]=$_
__[$_____] =____*int(______=$NF) \
substr(______,index(______,".") _____)
} } END {
for(_ in ___) { if(index(_,"Deg")) {
______=___[_]
print _
break } }
_________=match(________=sprintf("%.*s",\
index(____,!____),____),/$/)
for(_=_~_;_-________<=(_________ ________)*\
(_________ ________);_ ) {
if((_______=sprintf("%.*f",_________,_)) in __) {
_____=___[_______]
sub("[ \t]*[[:digit:]] [.][[:digit:]] $",
sprintf("%c%*.*f",--________, ________ _________,
________,__[_______]/______/____),_____)
print _____ } } }'
in0: 212MiB 0:00:02 [97.1MiB/s] [97.1MiB/s]
[=============================>] 100%
# Deg[°] Angle[A ,B ,C ]
1.000 7.5148221018
3.000 7.4967176419
5.000 7.5160005498
7.000 7.4793862628
9.000 7.5123479596
11.000 7.4791082935
13.000 7.4858962001
15.000 7.4941294148
17.000 7.5150168021
19.000 7.5067556155
21.000 7.5146136198
23.000 7.4792701433
25.000 7.4801382861
27.000 7.5026906476
29.000 7.4802267331
31.000 7.5216754387
33.000 7.4892379481
35.000 7.4905661773
37.000 7.4759338641
39.000 7.5130521094
41.000 7.4923359448
43.000 7.4680275394
45.000 7.5131741424
47.000 7.5022641880
49.000 7.4865545672
51.000 7.5280509182
53.000 7.4982720538
55.000 7.5082048446
57.000 7.5034726853
59.000 7.4978429619
61.000 7.5055566807
63.000 7.5108651984
65.000 7.5211276535
67.000 7.4875763176
69.000 7.4993074644
71.000 7.5124084003
73.000 7.5321662989
75.000 7.4859560680
77.000 7.4700932217
79.000 7.5121024268
81.000 7.5180572994
83.000 7.4938736294
85.000 7.5073566749
87.000 7.4917927829
89.000 7.5142626391
91.000 7.5223228551
93.000 7.5168014947
95.000 7.4757822101
97.000 7.5141328593
99.000 7.4863544344
101.000 7.5036731671
103.000 7.5200733708
105.000 7.4964541138
107.000 7.5050440318
109.000 7.4890049434
111.000 7.5045965882
113.000 7.5119613957
115.000 7.5050971735
117.000 7.4983417123
119.000 7.4867090870
121.000 7.5047947039
123.000 7.4837043078
125.000 7.4995212486
127.000 7.5111280706
129.000 7.5092771858
131.000 7.4977679060
133.000 7.5278372066
135.000 7.4794945181
137.000 7.5152681775
139.000 7.4954245649
141.000 7.5099441844
143.000 7.4945221883
145.000 7.4860083947
147.000 7.4848234307
149.000 7.4932545468
151.000 7.4937942058
153.000 7.4657789265
155.000 7.4947049961
157.000 7.5113607827
159.000 7.4978364461
161.000 7.5031970850
163.000 7.5017955073
165.000 7.5187543102
167.000 7.5064268609
169.000 7.4985988429
171.000 7.5438396243
173.000 7.4917706435
175.000 7.4589904950
177.000 7.5072644989
179.000 7.5176241959