I'm working with GWAS data. Need help.
My data looks like this:
IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.
As above I have a total of 512 rows and 2 Million columns.
Desired output:
SNP, Genotyping
rs098083, {
CC : [ 1, 63, 6, 18, 33, ...],
CT : [ 2, 54, 6, 7, 8, ...],
TT : [ 4, 9, 12, 13, ...],
AA : [86, 124, 4, 19, ...],
AT : [8, 98, 34, 74, ....],
.
.
.
}
kgp794789, {
CC : [ 1, 63, 6, 18, 33, ...],
CT : [ 2, 5, 6, 7, 8, ...],
TT : [ 4, 9, 12, 13, ...],
AA : [86, 124, 4, 19, ...],
AT : [8, 98, 34, 74, ....],
.
.
.
}
rs09848309, {
CC : [ 1, 63, 6, 18, 3, ...],
CT : [ 2, 5, 6, 7, 8, ...],
TT : [ 4, 9, 24 13, ...],
AA : [86, 134, 4, 19, ...],
AT : [8, 48, 34, 44, ....],
.
.
.
As above after pivoting, I should have a JSON file of 2 million rows & 2 Columns. The SNP
column of the row contains the ID of the SNP. The genotyping
column will contain a JSON BLOB. This BLOB will be a set of key-value pairs. The key is a particular genotype (e.g., CC, CT, TT, ....) and the value is a list of the IIDs with a genotype matching the key.
CodePudding user response:
Here's an approach using stedolan/jq:
jq -Rrn '
[ inputs / "," ] | transpose | .[0][1:] as $h | .[1:][]
| .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] = [$t[1]]) | @text]
| join(", ")
'
rs098083, {"CC":["63"],"AT":["54"],"TT":["12"]}
kgp794789, {"AG":["63"],"CT":["54"],"GA":["12"]}
rs09848309, {"GA":["63"],"TT":["54"],"AG":["12"]}
kgp8300747, {"AA":["63","12"],"AG":["54"]}
Add tonumber
if the IDs should be encoded as JSON numbers
jq -Rrn '
[ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
| .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] = [$t[1]]) | @text]
| join(", ")
'
rs098083, {"CC":[63],"AT":[54],"TT":[12]}
kgp794789, {"AG":[63],"CT":[54],"GA":[12]}
rs09848309, {"GA":[63],"TT":[54],"AG":[12]}
kgp8300747, {"AA":[63,12],"AG":[54]}
If your ultimate goal is to have a JSON representation anyways, omit formatting the raw output, and something like this might do:
jq -Rn '
[ inputs / "," ] | transpose | .[0][1:] as $h | reduce .[1:][] as $t (
{}; .[$t[0]] = reduce ([$t[1:],$h] | transpose[]) as $i (
{}; .[$i[0]] = [$i[1]]
)
)
'
{
"rs098083": { "CC": ["63"], "AT": ["54"], "TT": ["12"] },
"kgp794789": { "AG": ["63"], "CT": ["54"], "GA": ["12"] },
"rs09848309": { "GA": ["63"], "TT": ["54"], "AG": ["12"] },
"kgp8300747": { "AA": ["63", "12"], "AG": ["54"] }
}
Demo (formatted manually for easier comaprison with previous solutions)
CodePudding user response:
This is a request for clarification
What's the desired output?
I ask this because the one that you provided is an invalid CSV containing invalid JSON, so it's difficult to believe that you really want something like this.
Problem with the JSON:
- this is invalid:
{ CC : [1,2] }
- The keys need double-quotes:
{ "CC" : [1,2] }
Problem with the CSV:
- this is invalid:
rs0993,{
"CC": [1,2],
"CT": [3]
}
- The second column contains newlines and/or commas and/or double-quotes so it shall be escaped with the CSV rules:
rs0993,"{
""CC"": [1,2],
""CT"": [3]
}"
Plausible desired outputs:
- A CSV with embedded JSON:
SNP,Genotyping
rs098083,"{""CC"": [63], ""AT"": [54], ""TT"": [12]}"
kgp794789,"{""AG"": [63], ""CT"": [54], ""GA"": [12]}"
- A JSON array of arrays:
[
["SNP", "Genotyping"],
["rs098083", {"CC": [63], "AT": [54], "TT": [12]}],
["kgp794789", {"AG": [63], "CT": [54], "GA": [12]}]
]
- A JSON array of objects:
[
{"SNP": "rs098083", "Genotyping": {"CC": [63], "AT": [54], "TT": [12]}},
{"SNP": "kgp794789", "Genotyping": {"AG": [63], "CT": [54], "GA": [12]}}
]
With that in mind, make sure that the input and output that you provided are really what you have/want; please edit your question if that's not the case.