A sed script that inserts spaces between field in a fixed length encoded record without delimiters?-CodePudding

I have records like this

000000000011111111112222222222333444555666777888999aaabbbcccdddeee

where there are no delimiters, and I want to read this into a bash script array. If there were delimiters I could just say

IFS='|' record=($line)

and that would be the end of it. But when the fields are fully populated there are no delimiters.

So I thought I'd make a quick sed script

IFS='|' record=( $( echo "$line" |sed 's/\(...\)/\1 /g' ) )

which would -- in this case -- put a space delimiter between equal length fields of 3 characters.

But my field widths vary!

IFS='|' record=( $( echo "$line" |sed 's/\(.\{10\}\)\(.\{10\}\)\(.\{10\}\)\(.\{3\}\)\(.\{3\}\)\(.\{3\}\)/\1 \2 \3 \4 \5 \6 /g' ) )

easy! But not possible in my case because I have more than 9 fields!

I guess there must be some way with sed to control the s/.../.../g behavior so that you could do 3 x width of 10 and then 10 times width of 3, or whatever the field lengths are. But I just never really remember how this is done with sed and the man page is notoriously uneducative.

I figure I could do it by making a loop and then read each field with read -n $width and build up my own array. This is what I'll do, but I'd prefer some one-shot way of doing it. Would be really easy if there was a scanf(1) command available in the shell environment just like there is a printf(1) command, or if bash read -a had an -n 10,10,10,3,3,3* format string or something like that.

CodePudding user response：

Since you want to create a bash array, using an external tool like sed is suboptimal. You'd have to parse the input twice, first using the external tool, and then bash. It is safer, more efficient, and probably easier to do everything in bash.

Bash's built-in [[ can match regexes using =~. The matched groups are stored in the array BASH_REMATCH:

printf -v regex '(.{0,%s})' 10 10 10 3 3 3 3
line=000000000011111111112222222222333444555666777888999aaabbbcccdddeee
[[ "$line" =~ ^$regex ]]
fields=("${BASH_REMATCH[@]:1}")

This ignores characters exceeding the specified fields and leaves (partially) empty array entries if the specified fields exceed the line. But you can adapt this to your needs.

CodePudding user response：

You could use Perl's unpack function.

For small number of records, you could do (using your sample line):

IFS='|' record=($(perl -ple '$_=join"|",unpack"(a10)3(a3)12"' <<<"$line"))

Because it runs a new perl process for every line, if you have many, it would be more efficient to wrap in a loop along lines of:

perl -ple '$_=join"|",unpack"(a10)3(a3)12"' inputfile |\
while IFS='|' read -ra record; do
    : process ${record[@]}
done

(assumes the fixed-width records are delimited by newlines)

CodePudding user response：

One idea using GNU awk and FIELDWIDTHS to introduce delimiters:

x='000000000011111111112222222222333444555666777888999aaabbbcccdddeee'

awk '
BEGIN { OFS="|"                                              # define output delimiter
        FIELDWIDTHS = "10 10 10 3 3 3 3 3 3 3 3 3 3 3 3"     # define width of each field
      }
      { $1=$1                                                # force an evaluation so that fields are parsed
        print 
      }
' <<< "${x}"

This generates:

0000000000|1111111111|2222222222|333|444|555|666|777|888|999|aaa|bbb|ccc|ddd|eee

From here you can do what you want with the | delimited data (eg, read into an array).

NOTES:

FIELDWIDTHS is just a variable so for processing inputs with a variable number of fields you could certainly generalize this awk script and pass in a string to define FIELDWIDTHS
for this example we used a single variable (x) but we could just as easily feed a file to the awk script to add delimiters to all input lines
if required in several places it shoul be easy enough to wrap this in a user-defined function

CodePudding user response：

If you work from the right to left, this sed should produce the expected results for the example data.

$ array=($(sed -E 's/(.{3})/\1 /g10;s/(.{10})/\1 /2;s/(.{10})/\1 /' input_file))

s/(.{3})/\1 /g10 - This will deal with the 10x3 widths initially starting at the 30th character inserting a space every 3rd character.

s/(.{10})/\1 /2;s/(.{10})/\1 / The remaining 30 characters at the start will now be split into 3x10 once again from right to left.

Echoing the array created, the result is as follows.

$ echo ${array[@]}
0000000000 1111111111 2222222222 333 444 555 666 777 888 999 aaa bbb ccc ddd eee