Rearranging a list to get the 2nd column entries as rows-CodePudding

I have a list associated to strings as follows;

A   string1^description1`string2^description2`string3^description3
B   string4^description4
C   string1^description1`string5^description5`string3^description3
D   .
E   string6^description6`string1^description1
F   string7^description7
G   string1^description1`string4^description4`string5^description5

I would like to switch the first and second columns so that the stings in the 2nd column are the main list and the previous 1st column becomes the string as follows;

string1^description1    A   C   E   G
string2^description2    A
string3^description3    A   C
string4^description4    B   G
string5^description5    C   G
string6^description6    E
string7^description7    F

I have struggled with this and can't come up with anything. I am new to scripting.

CodePudding user response：

from collections import defaultdict
data = '''A   string1^description1`string2^description2`string3^description3
B   string4^description4
C   string1^description1`string5^description5`string3^description3
D   .
E   string6^description6`string1^description1
F   string7^description7
G   string1^description1`string4^description4`string5^description5'''

d = defaultdict(list)
for line in data.split('\n'):  # split the input data into lines
    char, info = line.split()  # in each line get the char and info
    for desc in info.split('`'):  # get the categories separated by `
        if len(desc) < 6:      # avoid case like line D where there is no data
            continue
        d[desc].append(char)

for k, v in d.items():
    print(f"{k} {' '.join(v)}")

Output:

string1^description1 A C E G
string2^description2 A
string3^description3 A C
string4^description4 B G
string5^description5 C G
string6^description6 E
string7^description7 F

CodePudding user response：

An AWK solution:

#! /usr/bin/env bash

INPUT_FILE="$1"

awk \
'
BEGIN {
    FS=" "
}
{
    key=$1
    $1=""
    gsub(/^ */, "")
    n=split($0, a, /`/)
    for (i=1; i<=n; i  ) {
        if (a[i] != ".") {
            hash[a[i]]=hash[a[i]] "   " key
        }
    }
}
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (elem in hash) {
        print elem " " hash[elem]
    }
}
' \
< "${INPUT_FILE}"

Output:

string1^description1    A   C   E   G
string2^description2    A
string3^description3    A   C
string4^description4    B   G
string5^description5    C   G
string6^description6    E
string7^description7    F

CodePudding user response：

Since we seem to be iterating over the tags, here's a perl solution:

#!/usr/bin/env perl
use v5.10;
my %labels;
while (<>) {
  chomp;
  my ($label, $rest) = split ' ',$_,2;
  foreach my $key (split '`', $rest) {
    push @{$labels{$key}}, $label unless $key eq '.'
  }
}

foreach my $key (sort keys %labels) {
  say "$key\t", join("\t", @{$labels{$key}});
}

But all of these have the same idea. You split the lines on whitespace to separate the initial letter from the string/description pairs, then split those pairs on backtick to extract the individual ones, which become the keys in an associative array (dict/hash) whose values are lists of the letters on whose lines that string/description pair was seen.