Transform multiline text to CSV using awk-CodePudding

I am looking at some reviews and trying to decided the best company to buy apples (for example). I copied and pasted the text below I want to do some text-processing on it with Linux commands. From what I have read online awk is a good choice but I cannot get it to work.

I tried to take the line that has a rating and append it to the line above with a comma separation. For example: Abes Apples\n 4.1 would become Abes Apples, 4.1 and this would be repeated. My awk command tested was awk 'BEGIN {RS=""}{gsub(/\n[0-9]/, ", ", $0); print $0}' test.text and it give a result below but it is replacing the digit..

Abes Apples, .1,
(138) · apple company,   years in business (123) 456-7890
Adams Apples, .9,
(105) · apple company, 0  years in business (234) 567-8901
Apples are Amazing, .9,
(13) apple company, 0  years in business (345) 678-9012

The text file pattern is as follows and repeats for all lines in text file:

Company name
Rating
Number of reviews and company type
Years in business' and phone number

My goal is to convert this text file to csv like format where I have column headers of company name, rating, number of reviews (ignoring the 'apple company' text), years in buisness and phone number. Is this something that can be done with awk and other linux commands?

Current Input:

Abes Apples
4.1,
(138) · apple company
7  years in business (123) 456-7890
Adams Apples
4.9,
(105) · apple company
10  years in business (234) 567-8901
Apples are Amazing
3.9,
(13) apple company
10  years in business (345) 678-9012

Desired Output:

Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012

CodePudding user response：

With paragraph mode of RS in GNU awk you could try following awk code. Written and tested with your shown samples only. Using match function of GNU awk where using regex (^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n($[0-9] $)[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)(explained further down in this answer); this is creating an array named arr whose indexes are 1,2,3 and so on depending upon how many capturing groups are being created.

awk -v RS= -v OFS=", " '
{
  while(match($0,/(^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n(\([0-9] \))[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)/,arr)){
     print arr[2],arr[3]arr[5],arr[6],arr[7]
     $0=substr($0,RSTART RLENGTH)
  }
}
'  Input_file

Output will be as follows:

Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012

Explanation: Adding detailed explanation for used regex.

(^|\n)         ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*)       ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n             ##Matching a new line here.
([0-9] (\.[0-9] )?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n             ##Matching a new line here.
(\([0-9] \))   ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n       ##Matching everything just before new line followed by new line.
([0-9] )       ##Creating 6th capturing group which has 1 or more digits in it.
\ ?[^(]*       ##Matching literal   keeping it optional followed by everything just before (
([^\n]*)       ##Creating 7th capturing group and matching everything just before new line here.

CodePudding user response：

An alternative generic approach to parse files where the information for each output record is separated on succesive lines, as in the OP's example, is to set up pattern blocks to process lines depending on their position in the multiline input file.

For example, in the original question, the intended output records need information spanning four lines of the multiline input. The input file has company names in the 1st, 5th, 9th, (...n 4th), while the rating information is in lines 2, 6, 19 (...n 4). etc.

Thus blocks can be established by checking the line position using modulo division of the input line number by the repeat pattern size (4 in this case) of the input records:

(NR-1)%4 == 0 { #code to apply to lines 1, 5, 9, ...n 4 }
(NR-2)%4 == 0 { #code to apply to lines 2, 6, 10, ...n 4 }
#etc.

Formatting the csv output is simplified if commas and line breaks are set manually from within the code blocks so a BEGIN block can be used to suppress the default newline output record separator:

BEGIN {ORS=""}

This allows for fields to pulled out of different positions depending on which line of the repeating input records they sit in.

Subsititutions can be targetted to lines easily too, which may simplify regex constructions. For example, the data in the question requires the removal of the plus sign from the first field in the 4th, 8th, 12th .. n 4th lines so a simple substitution pattern can be added to affect only those lines. The same lines also had intended output in the last and second last fields (which hold the code and phone number). The block required for those lines becomes:

(NR-4)%4 == 0 {sub(/\ /,"");print $1 ", " $(NF-1) " " $NF "\n"}

Note the insertion of the (output field-separating) comma and (output record-separating) new line.

The overall awk commands can be built line-by-line in the terminal without needing to place them in a separate script file:

awk 'BEGIN {ORS=""} 
(NR-1)%4 == 0 {print $0 ", "} 
(NR-2)%4 == 0 {print $0 " "} 
(NR-3)%4 == 0 {print $1 ", "} 
(NR-4)%4 == 0 {sub(/\ /,"");print $1 ", " $(NF-1) " " $NF "\n"}
' input.txt

or entered as a single line for simple cases:

awk 'BEGIN {ORS=""} (NR-1)%4 == 0 {print $0 ", "} (NR-2)%4 == 0 {print $0 " "} (NR-3)%4 == 0 {print $1 ", "} (NR-4)%4 == 0 {sub(/\ /,"");print $1 ", " $(NF-1) " " $NF "\n"}' apples.txt

Output from original question input data:

Abes Apples, 4.1, (138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012

CodePudding user response：

Using any awk:

$ cat tst.awk
BEGIN {
    blockSize = 4
    OFS = ", "
}
{
    lineNr = (NR - 1) % blockSize   1
}
lineNr == 1 {
    rec = $0
}
lineNr == 2 {
    sub(/,.*/,"")
    rec = rec OFS $0
}
lineNr == 3 {
    sub(/ .*/,"")
    rec = rec OFS $0
}
lineNr == 4 {
    years = $1
    sub(/[ ].*/,"",years)

    sub(/.*\(/,"(")
    rec = rec OFS years OFS $0

    print rec
}

$ awk -f tst.awk file
Abes Apples, 4.1, (138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012