Can I do the following with awk? Or is there a better way to do this-CodePudding

I am looking at some reviews and trying to decided the best company to buy apples (for example). I copied and pasted the text below I want to do some text-processing on it with Linux commands. From what I have read online awk is a good choice but I cannot get it to work.

I tried to take the line that has a rating and append it to the line above with a comma seperation. For example: Abes Apples\n 4.1 would become Abes Apples, 4.1 and this would be repeated. My awk command tested was awk 'BEGIN {RS=""}{gsub(/\n[0-9]/, ", ", $0); print $0}' test.text and it give a result below but it is replacing the digit..

Abes Apples, .1,
(138) · apple company,   years in business (123) 456-7890
Adams Apples, .9,
(105) · apple company, 0  years in business (234) 567-8901
Apples are Amazing, .9,
(13) apple company, 0  years in business (345) 678-9012

The text file pattern is as follows and repeats for all lines in text file:

Company name
Rating
Number of reviews and company type
Years in business' and phone number

My goal is to convert this text file to csv like format where I have column headers of company name, rating, number of reviews (ignoring the 'apple company' text), years in buisness and phone number. Is this something that can be done with awk and other linux commands?

Current Input:

Abes Apples
4.1,
(138) · apple company
7  years in business (123) 456-7890
Adams Apples
4.9,
(105) · apple company
10  years in business (234) 567-8901
Apples are Amazing
3.9,
(13) apple company
10  years in business (345) 678-9012

Desired Output:

Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012

CodePudding user response：

With paragraph mode of RS in GNU awk you could try following awk code. Written and tested with your shown samples only. Using match function of GNU awk where using regex (^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n($[0-9] $)[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)(explained further down in this answer); this is creating an array named arr whose indexes are 1,2,3 and so on depending upon how many capturing groups are being created.

awk -v RS= -v OFS=", " '
{
  while(match($0,/(^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n(\([0-9] \))[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)/,arr)){
     print arr[2],arr[3]arr[5],arr[6],arr[7]
     $0=substr($0,RSTART RLENGTH)
  }
}
'  Input_file

Output will be as follows:

Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012

Explanation: Adding detailed explanation for used regex.

(^|\n)         ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*)       ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n             ##Matching a new line here.
([0-9] (\.[0-9] )?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n             ##Matching a new line here.
(\([0-9] \))   ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n       ##Matching everything just before new line followed by new line.
([0-9] )       ##Creating 6th capturing group which has 1 or more digits in it.
\ ?[^(]*       ##Matching literal   keeping it optional followed by everything just before (
([^\n]*)       ##Creating 7th capturing group and matching everything just before new line here.