Home > OS >  Convert JSON to CSV/TSV
Convert JSON to CSV/TSV

Time:07-15

I'm trying to convert this data (https://rest.kegg.jp/get/br:ko00001/json) in JSON to CSV/TSV. I've been able to do so in awk and sed but I'm learning Perl for bigger projects so it'd be helpful to learn to do it without the JSON module.

sed -E 's/^\t{2}"name"/\t\t"level 1"/g;s/^\t{3}"name"/\t\t\t"level 2"/g;s/^\t{4}"name"/\t\t\t\t"level 3"/g;s/^\t{5}"name"/\t\t\t\t\t"level 4"/g' json.json | awk 'BEGIN {OFS="\t"} NR > 4 {match($0, /"([^"] )": *("[^"]*")/, a)} {tag = a[1]; val = gensub(/^"|"$/, "", "g", a[2]); f[tag] = val; if (tag == "level 4") {print f["level 1"], f["level 2"], f["level 3"], f["level 4"]}}' > table.tsv

Above is how I made it by awk and sed. json.json is downloaded from the link.

Here is what I've been trying so far in Perl without the JSON module. I'd like to do it this way to learn about the data structure and how Perl works.

use strict;

my $brite_hierarchy_filepath = shift @ARGV;

open my $brite_hierarchy, '<:utf8', $brite_hierarchy_filepath or die q{Can't open $brite_hierarchy_filepath: $!\n};

while (my $line = <$brite_hierarchy>) {
    next if $. == 4;
    chomp $line;

    $line =~ s/\A\t{2}"name"/"level_1"/;           
    $line =~ s/\A\t{3}"name"/"level_2"/;         
    $line =~ s/\A\t{4}"name"/"level_3"/;
    $line =~ s/\A\t{5}"name"/"level_4"/;

    my ($tag) = $line =~ /\A"(.*?)"/; 
    my ($value) = $line =~ /\A"level_[1-4]":"(.*?)"/;
    my %field = ($tag => $value) unless $tag eq "" && $value eq "";

    for (keys %field) {
        print join("\t", $field{"level_1"}, $field{"level_2"}, $field{"level_3"}, $field{"level_4"}, "\n");
    };
    last if eof $brite_hierarchy;
};

This is how the data looks like briefly.

    {
        "name":"ko00001",
        "children":[
        {
            "name":"09100 Metabolism",
            "children":[
            {
                "name":"09101 Carbohydrate metabolism",
                "children":[
                {
                    "name":"00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]",
                    "children":[
                    {
                        "name":"K00844  HK; hexokinase [EC:2.7.1.1]"
                    },
                    {
                        "name":"K12407  GCK; glucokinase [EC:2.7.1.2]"
                    },
                    {
                        "name":"K00845  glk; glucokinase [EC:2.7.1.2]"
...

And the desired output in TSV format.

09100 Metabolism    09101 Carbohydrate metabolism   00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]  K00844  HK; hexokinase [EC:2.7.1.1]
09100 Metabolism    09101 Carbohydrate metabolism   00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]  K12407  GCK; glucokinase [EC:2.7.1.2]
09100 Metabolism    09101 Carbohydrate metabolism   00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]  K00845  glk; glucokinase [EC:2.7.1.2]

CodePudding user response:

I would always suggest a JSON parser, but you can indeed treat this as just a fixed text file if you can guarantee that format never changes. In production, you usually can't. But if it's a one-off, then it certainly works.

The example input you pasted into the question has spaces, not tabs, so your code would not work on it. Neither would mine. My input is copied from your link, and has tabs.

Your regex patterns seem a bit complicated. You can always have the same trivial pattern, but just need to vary the number of tabs before each name. The trick is to skip to the next line whenever you find a name that is not the final column, and to reset the entire structure on the first column. I chose to use an array rather than a hash, as that makes more sense and we can just join later when we output. Finally, say is like print but with a built-in newline.

use strict;
use warnings;
use feature 'say';

my @names;
while (<DATA>) {
    if ( m/^\t"name":"(. )"/) {
        undef @names;
        $names[0] = $1;
        next;
    }
    if (m/^\t\t"name":"(. )"/) {
        $names[1] = $1;
        next;
    }
    if (m/^\t\t\t"name":"(. )"/) {
        $names[2] = $1;
        next;
    }
    if (m/^\t\t\t\t"name":"(. )"/) {
        $names[3] = $1;
        next;
    }
    if (m/^\t\t\t\t\t"name":"(. )"/) {
        $names[4] = $1;
        say join "\t", @names;
    }
}

__DATA__
{
    "name":"ko00001",
    "children":[
    {
        "name":"09100 Metabolism",
        "children":[
        {
            "name":"09101 Carbohydrate metabolism",
            "children":[
            {
                "name":"00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]",
                "children":[
                {
                    "name":"K00844  HK; hexokinase [EC:2.7.1.1]"
                },
                {
                    "name":"K12407  GCK; glucokinase [EC:2.7.1.2]"
                },

CodePudding user response:

Although the code doesn't look very clean, I manage to create the table in TSV format perfectly like the one produced by sed and awk.

Thanks all for the info on using the module JSON, but with this way, I learn a bit more about using the variable outside of the loop block, we can store it for the next turn in the loop.

use strict;

my $brite_hierarchy_filepath = shift @ARGV;

open my $brite_hierarchy, '<:utf8', $brite_hierarchy_filepath or die q{Can't open $brite_hierarchy_filepath: $!\n};

my $previous_1;
my $previous_2;
my $previous_3;

while (my $line = <$brite_hierarchy>) {
    next if $. == 4;
    chomp $line;
    
    # change accordingly to the hierarchical levels
    $line =~ s/\A\t{2}"name"/"level_1"/;           
    $line =~ s/\A\t{3}"name"/"level_2"/;         
    $line =~ s/\A\t{4}"name"/"level_3"/;
    $line =~ s/\A\t{5}"name"/"level_4"/;

    # find the categories and put them into a hash
    my ($tag) = $line =~ /\A"(.*?)"/; 
    my ($value) = $line =~ /\A"level_[1-4]":"(.*?)"/;
    my %field = ($tag => $value) unless $tag eq "" && $value eq "";

    for (keys %field) {
        $previous_1 = $field{"level_1"} if $_ eq "level_1" && $field{"level_1"} ne "";
        $previous_2 = $field{"level_2"} if $_ eq "level_2" && $field{"level_2"} ne "";
        $previous_3 = $field{"level_3"} if $_ eq "level_3" && $field{"level_3"} ne "";
        print join("\t", $previous_1, $previous_2, $previous_3, $field{"level_4"}, "\n") unless $field{"level_4"} eq "";
    };
    last if eof $brite_hierarchy;
};
  • Related