I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
Most of the fields appear only once by id
but not all of them (see time, for example).
What I want to achieve is to have one line per id
that combines the columns of all records with the same id
:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user
For the columns that appear more than once in each id
, I don't care which one is returned as long as it relates to a record with the same id
.
CodePudding user response:
I would exploit GNU AWK
2D arrays following way, let file.txt
content be
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
then
awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i =1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt
output
id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
Explanation: Firstly I inform GNU AWK
that both field separator (FS
) and output field separator (OFS
) is ,
, I use cols
variable for holding information how many columns you wish to have. First row I simply print
, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END
I use nested for loops, for each id I do set value of its field in current line, so GNU AWK
build string from these which I can print
. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.
(tested in gawk 4.2.1)
CodePudding user response:
You can use the ruby
csv parser to group then reduce the repeated entries:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr|
puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}
' file
Prints:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
This assumes the valid data is the first non-nil, nonblank encountered for that particular column.