Pivotting the SNP Table , Converting CSV file to JSON using Bash-CodePudding

I'm working with GWAS data. Need help.

My data looks like this:

IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.

As above I have a total of 512 rows and 2 Million columns.

Desired output:

SNP, Genotyping
rs098083, {
CC : [ 1, 63, 6, 18, 33, ...],
CT : [ 2, 54, 6, 7, 8, ...],
TT : [ 4, 9, 12, 13, ...],
AA : [86, 124, 4, 19, ...],
AT : [8, 98, 34, 74, ....],
.
.
.
}     
kgp794789, {
CC : [ 1, 63, 6, 18, 33, ...],
CT : [ 2, 5, 6, 7, 8, ...],
TT : [ 4, 9, 12, 13, ...],
AA : [86, 124, 4, 19, ...],
AT : [8, 98, 34, 74, ....],
.
.
.

}
rs09848309, {
CC : [ 1, 63, 6, 18, 3, ...],
CT : [ 2, 5, 6, 7, 8, ...],
TT : [ 4, 9, 24 13, ...],
AA : [86, 134, 4, 19, ...],
AT : [8, 48, 34, 44, ....],
.
.
.

As above after pivoting, I should have a JSON file of 2 million rows & 2 Columns. The SNP column of the row contains the ID of the SNP. The genotyping column will contain a JSON BLOB. This BLOB will be a set of key-value pairs. The key is a particular genotype (e.g., CC, CT, TT, ....) and the value is a list of the IIDs with a genotype matching the key.

CodePudding user response：

Here's an approach using stedolan/jq:

jq -Rrn '
  [ inputs / "," ] | transpose | .[0][1:] as $h | .[1:][]
  | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]]  = [$t[1]]) | @text]
  | join(", ")
'

rs098083, {"CC":["63"],"AT":["54"],"TT":["12"]}
kgp794789, {"AG":["63"],"CT":["54"],"GA":["12"]}
rs09848309, {"GA":["63"],"TT":["54"],"AG":["12"]}
kgp8300747, {"AA":["63","12"],"AG":["54"]}

Demo

Add tonumber if the IDs should be encoded as JSON numbers

jq -Rrn '
  [ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
  | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]]  = [$t[1]]) | @text]
  | join(", ")
'

rs098083, {"CC":[63],"AT":[54],"TT":[12]}
kgp794789, {"AG":[63],"CT":[54],"GA":[12]}
rs09848309, {"GA":[63],"TT":[54],"AG":[12]}
kgp8300747, {"AA":[63,12],"AG":[54]}

Demo

If your ultimate goal is to have a JSON representation anyways, omit formatting the raw output, and something like this might do:

jq -Rn '
  [ inputs / "," ] | transpose | .[0][1:] as $h | reduce .[1:][] as $t (
    {}; .[$t[0]] = reduce ([$t[1:],$h] | transpose[]) as $i (
      {}; .[$i[0]]  = [$i[1]]
    )
  )
'

{
  "rs098083": { "CC": ["63"], "AT": ["54"], "TT": ["12"] },
  "kgp794789": { "AG": ["63"], "CT": ["54"], "GA": ["12"] },
  "rs09848309": { "GA": ["63"], "TT": ["54"], "AG": ["12"] },
  "kgp8300747": { "AA": ["63", "12"], "AG": ["54"] }
}

Demo (formatted manually for easier comaprison with previous solutions)

CodePudding user response：

^{This is a request for clarification}

What's the desired output?

I ask this because the one that you provided is an invalid CSV containing invalid JSON, so it's difficult to believe that you really want something like this.

Problem with the JSON:

this is invalid:

{ CC : [1,2] }

The keys need double-quotes:

{ "CC" : [1,2] }

Problem with the CSV:

this is invalid:

rs0993,{
"CC": [1,2],
"CT": [3]
}

The second column contains newlines and/or commas and/or double-quotes so it shall be escaped with the CSV rules:

rs0993,"{
""CC"": [1,2],
""CT"": [3]
}"

Plausible desired outputs:

A CSV with embedded JSON:

SNP,Genotyping
rs098083,"{""CC"": [63], ""AT"": [54], ""TT"": [12]}"
kgp794789,"{""AG"": [63], ""CT"": [54], ""GA"": [12]}"

A JSON array of arrays:

[
  ["SNP", "Genotyping"],
  ["rs098083", {"CC": [63], "AT": [54], "TT": [12]}],
  ["kgp794789", {"AG": [63], "CT": [54], "GA": [12]}]
]

A JSON array of objects:

[
  {"SNP": "rs098083", "Genotyping": {"CC": [63], "AT": [54], "TT": [12]}},
  {"SNP": "kgp794789", "Genotyping": {"AG": [63], "CT": [54], "GA": [12]}}
]

With that in mind, make sure that the input and output that you provided are really what you have/want; please edit your question if that's not the case.