Home > Software engineering >  Combine log lines with awk
Combine log lines with awk

Time:05-10

I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user

Most of the fields appear only once by id but not all of them (see time, for example).

What I want to achieve is to have one line per id that combines the columns of all records with the same id:

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user

For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.

CodePudding user response:

I would exploit GNU AWK 2D arrays following way, let file.txt content be

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user

then

awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i =1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt

output

id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user

Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.

(tested in gawk 4.2.1)

CodePudding user response:

You can use the ruby csv parser to group then reduce the repeated entries:

ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr| 
   puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}   
' file  

Prints:

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user

This assumes the valid data is the first non-nil, nonblank encountered for that particular column.

  • Related