Wrapping delimited lines, retaining first column, with minimum final length-CodePudding

Looking to split up lines of content, retaining a headword.

I do a ton of text processing, and I like to use unix one-liners because they are easy for me to organize over time (vs. tons of scripts), I can easily chain them together, and I like (re)learning how to use classic unix functions. Often I will use a short awk, perl, or ruby one-liner, depending on which is the most elegant.

Here I have lines with X number of comma-delimited items. I want to divide these up, retaining the headword.

INPUT:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab

OUTPUT:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

Algorithm details:

input lines consist of a headword, then equals-sign, then a comma delimited list of at least 1 item.
In this example, most words are singles, but words could contain spaces (e.g. "horseshoe crab" at the end)
Split is at 9 items, UNLESS there are <3, in which case the final split could yield 12 on a line
There are multiple lines. e.g. the next line could be planets.

I had an idea to escape spaces, then use unix fold, and then awk to pull down the first column. This works exactly like the above:

echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} $1==""{$1=p} {p=$1} 1' \
| tr '\t _'  '=, '

But it only considers character length (not item count), and fails to consider my special case that I don't want <3 items hanging on the final line.

I think this is an elegant little puzzle, got ideas?

CodePudding user response：

You may consider this awk:

awk 'BEGIN {FS=OFS=" = "} {
   s = $2
   while (match(s, /([^,] , ){1,9}(([^,] , ){2}[^,] $)?/)) {
      v = substr(s, RSTART, RLENGTH)
      sub(/, $/, "", v)
      print $1, v
      s = substr(s, RLENGTH 1)
   }
}' file

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab

Pay special attention to regex used here /([^,] , ){1,9}(([^,] , ){2}[^,] $)?/

That matches 1 to 9 words separated with , delimiter. This regex also has an optional part that matches upto 3 words before end of line.

CodePudding user response：

One awk idea:

awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i  ) {
      if ( (i-1) % max == 1 && (NF-i 1 > min) ) {
         if ( i > max ) print newline
         newline=$1 "="
         pfx=""
      }
      newline=newline pfx $i
      pfx=","
  }
  print newline
}
' raw.dat

Sample data:

$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

With -v min=3 -v max=9 we get:

animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13

CodePudding user response：

With your shown samples only, please try following awk program. Written and tested in GNU awk should work in any awk.

Where I have created an awk variable named numberOfFields which contains number of fields you want to print(as segregated with new line as per shown samples).

awk  -v numberOfFields="9" '
BEGIN{
  FS=", ";OFS=", "
}
{
  line=$0
  sub(/ = .*/,"",line)
  sub(/^[^ ]* =[^ ]* /,"")
  for(i=1;i<=NF;i  ){
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
    (i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{
  print ""
}
'  Input_file

OR Above code is having printf statement in 2 lines(for readability purposes) if you want to have that into a single line itself then try following:

awk  -v numberOfFields="9" '
BEGIN{
  FS=", ";OFS=", "
}
{
  line=$0
  sub(/ = .*/,"",line)
  sub(/^[^ ]* =[^ ]* /,"")
  for(i=1;i<=NF;i  ){
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{
  print ""
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk  -v numberOfFields="9" '            ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{                                  ##Starting BEGIN section of awk here.
  FS=", ";OFS=", "                      ##Setting FS and OFS to comma space here.
}
{
  line=$0                               ##Setting value of $0 to line here.
  sub(/ = .*/,"",line)                  ##Substituting space = space everything till last of value in line with NULL.
  sub(/^[^ ]* =[^ ]* /,"")              ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
  for(i=1;i<=NF;i  ){                   ##Running for loop here for all fields.
    printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\  ##Using printf and its conditions are explained below of code.
    (i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
  }
}
END{                                    ##Starting END block of this program from here.
  print ""                              ##Printing newline here.
}
'  Input_file                           ##Mentioning Input_file name here.

Explanation of printf condition above:

(
  i%numberOfFields==0                   ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
    ?OFS $i ORS line" = "               ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
    :(i==1                              ##If very first condition is FALSE then checking again if i==1
       ?line " = " $i                   ##Then print line variable followed by space = space followed by $i
       :(i%numberOfFields>1?OFS $i:$i)  ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
     )
)