I am looking at some reviews and trying to decided the best company to buy apples (for example). I copied and pasted the text below I want to do some text-processing on it with Linux commands. From what I have read online awk is a good choice but I cannot get it to work.
I tried to take the line that has a rating and append it to the line above with a comma seperation. For example: Abes Apples\n 4.1
would become Abes Apples, 4.1
and this would be repeated. My awk command tested was awk 'BEGIN {RS=""}{gsub(/\n[0-9]/, ", ", $0); print $0}' test.text
and it give a result below but it is replacing the digit..
Abes Apples, .1,
(138) · apple company, years in business (123) 456-7890
Adams Apples, .9,
(105) · apple company, 0 years in business (234) 567-8901
Apples are Amazing, .9,
(13) apple company, 0 years in business (345) 678-9012
The text file pattern is as follows and repeats for all lines in text file:
- Company name
- Rating
- Number of reviews and company type
- Years in business' and phone number
My goal is to convert this text file to csv like format where I have column headers of company name, rating, number of reviews (ignoring the 'apple company' text), years in buisness and phone number. Is this something that can be done with awk and other linux commands?
Current Input:
Abes Apples
4.1,
(138) · apple company
7 years in business (123) 456-7890
Adams Apples
4.9,
(105) · apple company
10 years in business (234) 567-8901
Apples are Amazing
3.9,
(13) apple company
10 years in business (345) 678-9012
Desired Output:
Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012
CodePudding user response:
With paragraph mode of RS
in GNU awk
you could try following awk
code. Written and tested with your shown samples only. Using match
function of GNU awk
where using regex (^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n(\([0-9] \))[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)
(explained further down in this answer); this is creating an array named arr
whose indexes are 1,2,3 and so on depending upon how many capturing groups are being created.
awk -v RS= -v OFS=", " '
{
while(match($0,/(^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n(\([0-9] \))[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)/,arr)){
print arr[2],arr[3]arr[5],arr[6],arr[7]
$0=substr($0,RSTART RLENGTH)
}
}
' Input_file
Output will be as follows:
Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012
Explanation: Adding detailed explanation for used regex.
(^|\n) ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*) ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n ##Matching a new line here.
([0-9] (\.[0-9] )?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n ##Matching a new line here.
(\([0-9] \)) ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n ##Matching everything just before new line followed by new line.
([0-9] ) ##Creating 6th capturing group which has 1 or more digits in it.
\ ?[^(]* ##Matching literal keeping it optional followed by everything just before (
([^\n]*) ##Creating 7th capturing group and matching everything just before new line here.