How to format bash SED/AWK/Per output for further processing-CodePudding

I have some text file data that I am parsing with SED, AWK and Perl.

product {
    name { thing1 }
    customers {        
        mary { }
        freddy { }
        bob {
            spouse betty
        }
    }
}

From the "customers" section, I am trying to get output similar to:

mary{ }
freddy{ }
bob{spouse betty}

Using: sed -n -e "/customers {/,/}/{/customers {/d;/}/d;p;}" $file'

This is the output:

mary { }
freddy { }
bob {
    spouse betty
}

How can I concatenate the "bob" customer to one line and remove the extra spaces? The main reason for this specific output is that I am writing a script to grab the "customer" fields and other fields in the text file, then outputting them to a csv file. Which will look something like this. I know this would probably be easier in another language, but bash is what I know.

output.csv
product,customers,another_column
thing1,mary{ } freddy{ } bob{spouse betty},something_else

CodePudding user response：

The data happens to have valid tcl list syntax:

set f [open "input.file"]
set data [dict create {*}[read $f]]
close $f

set name [string trim [dict get $data product name]]
dict for {key val} [dict get $data product customers] {
    lappend customers [format "%s{%s}" $key [string trim $val]]
}

set f [open "output.csv" w]
puts $f "product,customers,another_column"
puts $f [join [list $name [join $customers] "something_else"] ,]
close $f

creates output.csv with

product,customers,another_column
thing1,mary{} freddy{} bob{spouse betty},something_else

CodePudding user response：

With your shown samples Only. In GNU awk you could try following awk code. We could do it in a single GNU awk, we need not to pass your sed command's output to any other tool. Just pass your Input_file to this awk program(s).

1st solution: To get output between customers section to } its closing bracket and values not having starting spaces try following GNU awk solution.

awk -v RS='\n[[:space:]] customers {[[:space:]]*.*\n[[:space:]] }' '
RT{
  sub(/^\n[[:space:]] [^ ]* {[[:space:]]*\n/,"",RT)
  sub(/\n[[:space:]] }/,"",RT)
  match(RT,/(.*{)[[:space:]]*([^\n]*)(.*)/,arr)
  sub(/^[[:space:]] /,"",arr[1])
  sub(/\n/,"",arr[2])
  gsub(/\n|^[[:space:]] /,"",arr[3])
  gsub(/\n[[:space:]] /,"\n",arr[1])
  gsub(/ {/,"{",arr[1])
  print arr[1] arr[2] arr[3]
}
'   Input_file

Output will be as follows:

mary{ }
freddy{ }
bob{spouse betty}

2nd solution: To have starting spaces before values try following code.

awk -v RS='\n[[:space:]] customers {[[:space:]]*.*\n[[:space:]] }' '
RT{
  sub(/^\n[[:space:]] [^ ]* {[[:space:]]*\n/,"",RT)
  sub(/\n[[:space:]] }/,"",RT)
  match(RT,/(.*{)[[:space:]]*([^\n]*)(.*)/,arr)
  sub(/\n/,"",arr[2])
  gsub(/\n|^[[:space:]] /,"",arr[3])
  print arr[1] arr[2] arr[3]
}
'   Input_file

Output will be as follows:

        mary { }
        freddy { }
        bob {spouse betty}

Explanation: Simple explanation would be in GNU awk setting RS(record separator) as \n[[:space:]] customers {[[:space:]]*.*\n[[:space:]] } to match only required match. Then in main block of this awk program remove all unnecessary(not needed strings parts) as per requirement by sub(substitute function) and then using match function with regex (.*{)[[:space:]]*([^\n]*)(.*) with 3 capturing groups whose values are getting stored into an array named arr and then I am substituting all newlines/spaces from it and then printing the values of current line with RT for same.

CodePudding user response：

Maybe ed

ed -s file.txt <<-'EOF'
  %s/^[[:space:]]*//
  ?{?;/^}/j
  %s/^\([^\{]*\) \(.*\)$/\1\2 /
  /^customers/ 1;/^}/-1j
  s/^/thing1,/
  s/ *$/,someting_else/
  p
  Q
EOF

With a temp file, it is a bit more easier to write to a new file.

ed -s file.txt <<-'EOF'
  %s/^[[:space:]]*//
  /customers {/ 1;/^[[:space:]]*}/w out.txt
  %d
  r out.txt
  ?{?;/^}/j
  %s/^\([^\{]*\) \(.*\)$/\1\2 /
  %j
  s/^/thing1,/
  s/ *$/,someting_else/
  0a
product,customers,another_column
.
  w output.csv
  ,p
  Q
EOF

The latter creates two files, out.txt and output.csv
Remove the ,p if stdout output is not required.

CodePudding user response：

Edit See end for producing complete output

Here is a regex for it, probably in just about any language, run on the whole file in a string. This, as it stands, assumes that there can only be one level of nesting under a customer, in other words bob cannot have { pets { dog } } or some such.

Extract content of customers section

/customers\s*{\s* ( (?: [^{]  {[^}]*} )  )/x;

then collapse newline spaces into a single space

s/\n\s / /g;

then trim spaces from strings like bob { spouse }, but not from mary { }

s/{\s  ([^}] ) \s }/{$1}/gx;

If bob and the crew can really be only word-characters then instead of [^{}] we can use the far nicer \w.

Altogether, in a Perl command-line program ("one"-liner) as seems to be desired

perl -wE'die"file?\n" if not @ARGV; 
    $d = do { local $/; <> };
    ($c) = $d =~ /customers\s*{\s* ( (?: [^{]  {[^}]*} )  )/x; 
    $c =~ s/\n\s / /g;          
    $c =~ s/{\s  ([^}] ) \s }/{$1}/gx; 
    say $c
' data.txt

Prints, for data given in the question

mary { } freddy { } bob {spouse betty}

To print each customer in a separate line can do for example

say for split /(?<=\})\s /, $c;

(to be the last line in code)

I now realize that there is more to capture and print, described in the last paragraph. Adding to the beginning of the regex to capture the name, and adding the required printing

perl -wE'die"file?\n" if not @ARGV; 
    $d = do { local $/; <> };
    ($n, $c) = $d =~ /name\s*{\s* ([^}] ) \s*} .*?  customers\s*{\s* ( (?: [^{]  {[^}]*} )  )/sx; 
    $n =~ s/^\s |\s $//g;
    $c =~ s/\n\s / /g;
    $c =~ s/{\s  ([^}] ) \s }/{$1}/gx; 
    say "product,customers,another_column"
    say "$n,$c,something_else"
' data.txt > output.csv

Prints as shown in the question.

CodePudding user response：

Following code sample demonstrates most primitive parser for provided sample data.

This code restores data structure and can be then used any imaginable way, for example stored as CVS, JSON, YAML file.

In real life the input data can be quite different and this code probably will not process it correctly.

The code provided for educational purpose only.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $data = do { local $/; <DATA> };

$data =~ s/\n/ /g;
$data =~ s/  / /g;

say Dumper parse($data);

exit 0;

sub parse {
    my $str  = shift;   
    my $ret;

    while( $str =~ /^(\S ) \{ (\S ) \{ \S / ) {
        if( $str =~ /^(\S ) \{ (\S ) \{ ([^}] ?) \{(. ?)\}/ ) {
            $ret->{$1}{$2}{$3} = $4;
            $ret->{$1}{$2}{$3} =~ s/(^\s |\s $)//g;
            $str =~ s/^(\S ) \{ (\S ) \{(. ?)\{(.*?)\}/$1 \{ $2 \{/;
        }
        if( $str =~ /^(\S ) \{ (\S ) \{\s*([^{] ?)\s*\}/ ) {
            $ret->{$1}{$2} = $3 if length($3) > 1;
            $str =~ s/^(\S ) \{ \S  \{\s*[^\}] \s*\}/$1 \{/;
        }
    }
    
    return $ret;
}

__DATA__
product {
    name { thing1 }
    customers {        
        mary { }
        freddy { }
        bob {
            spouse betty
        }
    }
}

Output

$VAR1 = {
          'product' => {
                         'customers' => {
                                          'bob' => 'spouse betty',
                                          'freddy' => '',
                                          'mary' => ''
                                        },
                         'name' => 'thing1'
                       }
        };