I've got a file with several columns, like so:
13:46:48 1.2.3.4:57 user1
13:46:49 5.6.7.8:58 user2
13:48:07 9.10.11.12:59 user3
I'd like to transform one of the columns by passing it as input to a program:
echo "1.2.3.4:57" | transformExternalIp
10.0.0.4:57
I wrote a small bit of awk to do this:
awk '{ ("echo " $2 " | transformExternalIp") | getline output; $2=output; print}'
But what I got surprised me. Initially, it looked like it was working as expected, but then I started to see weird repeated values. In order to debug, I removed my fancy "transformExternalIp" program in case it was the problem and replaced it with echo and cat, which means literally nothing should change:
awk '{ ("echo " $2 " | cat") | getline output; print $2 " - " output}' connections.txt
For the first thousand lines or so, the left and right sides matched, but then after that, the right side frequently stopped changing:
1.2.3.4:57 - 1.2.3.4:57
2.2.3.4:12 - 2.2.3.4:12
3.2.3.4:24 - 3.2.3.4:24
# .... (okay for a long while)
120.120.3.4:57 - 120.120.3.4:57
121.120.3.4:25 - 120.120.3.4:57
122.120.3.4:100 - 120.120.3.4:57
123.120.3.4:76 - 120.120.3.4:57
What the heck have I done wrong? I'm guessing that I'm misunderstanding something about awk.
CodePudding user response:
Close the command after each invocation to insure a new copy of the command is run for the next set of input, eg:
awk '{ ("echo " $2 " | transformExternalIp") | getline output
close("echo " $2 " | transformExternalIp")
$2=output
print
}'
# or, to reduce issues from making a typo:
awk '{ cmd="echo " $2 " | transformExternalIp"
(cmd) | getline output
close(cmd)
$2=output
print
}'
For more details see this and this.
During my testing with a dummy script (echo $RANDOM; sleep .1
) I could generate similar results as OP ... some good/expected lines and then a bunch of duplicates.
I noticed that as soon as the duplicates started occuring, the dummy script wasn't actually being called any more and instead awk
was treating the system call as a static result (ie, kept re-using the value from the last 'good' call); it was quite noticeable because the sleep .1
was no longer being called so the output from the awk
script sped up significantly.
Can't say that I understand 100% what's happening under the covers ... perhaps an issue with how the script (my dummy script; OP's transforExternalIp
) behaves with multiple lines of input when expecting one line of input ... an issue with a limit on the number of open/active process handles ... shrug
CodePudding user response:
("echo" $2" | cat")
creates a fork almost every time that you use it.
Then, when the above instruction reaches some kind of fork limit, the output
variable isn't updated by getline
anymore; that's what's happening here.
If you're using GNU awk
then you can fix the issue with a Coprocess:
awk '
BEGIN { cmd = "cat" }
{
print $2 |& cmd
cmd |& getline output
print $2 " - " output
}
' connections.txt