Home > Blockchain >  Get comma separated list of column values based on value in another column
Get comma separated list of column values based on value in another column

Time:06-23

I want to get a comma-separated list of all of the values in certain columns (2,4,5) based on the value in column 1 of a tab-delimited file.

I was working with adapting the command below, but instead it is going to give me a list of all the values in the column, not just the one for each person - and I'm not sure how to do that.

awk -F"\t" '{print $2}' $i | sed -z 's/\n/,/g;s/,$/\n/'

This is what I am working with

Bob     24      M       apples  red
Bob     12      M       apples  green
Linda   56      F       apples  red
Linda   102     F       bananas yellow

And this is what I would like to get (I want to keep duplicates and the order)

Bob     24,12   M       apples,apples   red,green
Linda   56,102  F       apples,bananas  red,yellow

CodePudding user response:

Assumptions:

  • for duplicate names the gender will always be the same otherwise save the 'last' one seen

One awk idea:

awk '
BEGIN { FS=OFS="\t" }
      {   nums[$1] = nums[$1]   sep[$1] $2
        gender[$1] = $3
        fruits[$1] = fruits[$1] sep[$1] $4
        colors[$1] = colors[$1] sep[$1] $5
           sep[$1] = ","
      }
END   { # PROCINFO["sorted_in"]="@ind_str_asc"      # this line requires GNU awk
        for (name in nums)
            print name,nums[name],gender[name],fruits[name],colors[name]
      }
' input.tsv   

This generates:

Bob     24,12   M       apples,apples   red,green
Linda   56,102  F       apples,bananas  red,yellow

NOTE: this just happens to display the output in Name order; if ordering (by Name) needs to be guaranteed OP can run the output through sort or if using GNU awk then uncomment the PROCINFO["sorted_in"] line

CodePudding user response:

You never need sed when you're using awk.

Assuming your key values (first fields) are grouped as shown in your example (if not then sort the file first) then without reading the whole file into memory and for any number of input fields (you just have to identify which field numbers don't accumulate values, i.e. fields 1 and 3 in this case) you can do:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
$1 != vals[1] {
    if ( NR>1 ) {
        prt()
    }
    delete vals
}
{
    for ( i=1; i<=NF; i   ) {
        pre = ( (i in vals) && (i !~ /^[13]$/) ? vals[i] "," : "" )
        vals[i] = pre $i
    }
}
END { prt() }

function prt(    i) {
    for ( i=1; i<=NF; i   ) {
        printf "%s%s", vals[i], (i<NF ? OFS : ORS)
    }
}

$ awk -f tst.awk file
Bob     24,12   M       apples,apples   red,green
Linda   56,102  F       apples,bananas  red,yellow
  • Related