I am looking at some reviews and trying to decided the best company to buy apples (for example). I copied and pasted the text below I want to do some text-processing on it with Linux commands. From what I have read online awk is a good choice but I cannot get it to work.
I tried to take the line that has a rating and append it to the line above with a comma separation. For example: Abes Apples\n 4.1
would become Abes Apples, 4.1
and this would be repeated. My awk command tested was awk 'BEGIN {RS=""}{gsub(/\n[0-9]/, ", ", $0); print $0}' test.text
and it give a result below but it is replacing the digit..
Abes Apples, .1,
(138) · apple company, years in business (123) 456-7890
Adams Apples, .9,
(105) · apple company, 0 years in business (234) 567-8901
Apples are Amazing, .9,
(13) apple company, 0 years in business (345) 678-9012
The text file pattern is as follows and repeats for all lines in text file:
- Company name
- Rating
- Number of reviews and company type
- Years in business' and phone number
My goal is to convert this text file to csv like format where I have column headers of company name, rating, number of reviews (ignoring the 'apple company' text), years in buisness and phone number. Is this something that can be done with awk and other linux commands?
Current Input:
Abes Apples
4.1,
(138) · apple company
7 years in business (123) 456-7890
Adams Apples
4.9,
(105) · apple company
10 years in business (234) 567-8901
Apples are Amazing
3.9,
(13) apple company
10 years in business (345) 678-9012
Desired Output:
Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012
CodePudding user response:
With paragraph mode of RS
in GNU awk
you could try following awk
code. Written and tested with your shown samples only. Using match
function of GNU awk
where using regex (^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n(\([0-9] \))[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)
(explained further down in this answer); this is creating an array named arr
whose indexes are 1,2,3 and so on depending upon how many capturing groups are being created.
awk -v RS= -v OFS=", " '
{
while(match($0,/(^|\n)([^\n]*)\n([0-9] (\.[0-9] )?,)\n(\([0-9] \))[^\n]*\n([0-9] )\ ?[^(]*([^\n]*)/,arr)){
print arr[2],arr[3]arr[5],arr[6],arr[7]
$0=substr($0,RSTART RLENGTH)
}
}
' Input_file
Output will be as follows:
Abes Apples, 4.1,(138), 7, (123) 456-7890
Adams Apples, 4.9,(105), 10, (234) 567-8901
Apples are Amazing, 3.9,(13), 10, (345) 678-9012
Explanation: Adding detailed explanation for used regex.
(^|\n) ##Creating 1st capturing group which has either starting of value OR new line.
([^\n]*) ##Creating 2nd capturing group which contains everything just before next occurrence of new line.
\n ##Matching a new line here.
([0-9] (\.[0-9] )?,) ##Creating 3rd and 4th capturing group and matchig digits(1 or more occurrences) followed by dot followed by 1 or more digits keeping 4th capturing group as optional.
\n ##Matching a new line here.
(\([0-9] \)) ##Creating 5th capturing group which has ( followed by digits followed by ).
[^\n]*\n ##Matching everything just before new line followed by new line.
([0-9] ) ##Creating 6th capturing group which has 1 or more digits in it.
\ ?[^(]* ##Matching literal keeping it optional followed by everything just before (
([^\n]*) ##Creating 7th capturing group and matching everything just before new line here.
CodePudding user response:
An alternative generic approach to parse files where the information for each output record is separated on succesive lines, as in the OP's example, is to set up pattern blocks to process lines depending on their position in the multiline input file.
For example, in the original question, the intended output records need information spanning four lines of the multiline input. The input file has company names in the 1st, 5th, 9th, (...n 4th), while the rating information is in lines 2, 6, 19 (...n 4). etc.
Thus blocks can be established by checking the line position using modulo division of the input line number by the repeat pattern size (4 in this case) of the input records:
(NR-1)%4 == 0 { #code to apply to lines 1, 5, 9, ...n 4 }
(NR-2)%4 == 0 { #code to apply to lines 2, 6, 10, ...n 4 }
#etc.
Formatting the csv output is simplified if commas and line breaks are set manually from within the code blocks so a BEGIN
block can be used to suppress the default newline output record separator:
BEGIN {ORS=""}
This allows for fields to pulled out of different positions depending on which line of the repeating input records they sit in.
Subsititutions can be targetted to lines easily too, which may simplify regex constructions. For example, the data in the question requires the removal of the plus sign from the first field in the 4th, 8th, 12th .. n 4th lines so a simple substitution pattern can be added to affect only those lines. The same lines also had intended output in the last and second last fields (which hold the code and phone number). The block required for those lines becomes:
(NR-4)%4 == 0 {sub(/\ /,"");print $1 ", " $(NF-1) " " $NF "\n"}
Note the insertion of the (output field-separating) comma and (output record-separating) new line.
The overall awk commands can be built line-by-line in the terminal without needing to place them in a separate script file:
awk 'BEGIN {ORS=""}
(NR-1)%4 == 0 {print $0 ", "}
(NR-2)%4 == 0 {print $0 " "}
(NR-3)%4 == 0 {print $1 ", "}
(NR-4)%4 == 0 {sub(/\ /,"");print $1 ", " $(NF-1) " " $NF "\n"}
' input.txt
or entered as a single line for simple cases:
awk 'BEGIN {ORS=""} (NR-1)%4 == 0 {print $0 ", "} (NR-2)%4 == 0 {print $0 " "} (NR-3)%4 == 0 {print $1 ", "} (NR-4)%4 == 0 {sub(/\ /,"");print $1 ", " $(NF-1) " " $NF "\n"}' apples.txt
Output from original question input data:
Abes Apples, 4.1, (138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012
CodePudding user response:
Using any awk:
$ cat tst.awk
BEGIN {
blockSize = 4
OFS = ", "
}
{
lineNr = (NR - 1) % blockSize 1
}
lineNr == 1 {
rec = $0
}
lineNr == 2 {
sub(/,.*/,"")
rec = rec OFS $0
}
lineNr == 3 {
sub(/ .*/,"")
rec = rec OFS $0
}
lineNr == 4 {
years = $1
sub(/[ ].*/,"",years)
sub(/.*\(/,"(")
rec = rec OFS years OFS $0
print rec
}
$ awk -f tst.awk file
Abes Apples, 4.1, (138), 7, (123) 456-7890
Adams Apples, 4.9, (105), 10, (234) 567-8901
Apples are Amazing, 3.9, (13), 10, (345) 678-9012