Combine log lines with awk-CodePudding

I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user

Most of the fields appear only once by id but not all of them (see time, for example).

What I want to achieve is to have one line per id that combines the columns of all records with the same id:

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user

For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.

CodePudding user response：

I would exploit GNU AWK 2D arrays following way, let file.txt content be

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user

then

awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i =1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt

output

id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user

Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.

(tested in gawk 4.2.1)

CodePudding user response：

You can use the ruby csv parser to group then reduce the repeated entries:

ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr| 
   puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}   
' file

Prints:

id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user

This assumes the valid data is the first non-nil, nonblank encountered for that particular column.