Combine or group lines by same words using macOS sed or awk or grep or gsub (prefer awk):
If line has single class then combine using "or" such as
.class1
.class2
.class3
to
(.class1 or .class2 or .class3)
If line already has two or more class and any classes are matching with other lines then it combined following ways:
.class4 .class5
.class4 .class6
.class9 .class10 .class11
.class9 .class10 .class12
to
.class4 and (.class5 or .class6)
.class9 and .class10 and (.class11 or .class12)
Here is an example of text file
file.txt
.class1
.class2
.class3
.class4
.class4 .class5
.class4 .class6
.class7 .class8
.class9
.class9 .class10
.class9 .class10 .class11
.class9 .class10 .class12
expected
(.class1 or .class2 or .class3 or .class4 or .class9)
.class4 and (.class5 or .class6)
.class7 and .class8
.class9 and .class10
.class9 and .class10 and (.class11 or .class12)
Here is what I tried:
awk '/ /{if (x)print x;x="";}{x=(!x)?$0:x" or "$0;}END{print x;}' file.txt > file1.txt
got following result:
.class1 or .class2 or .class3 or .class4
.class4 .class5
.class4 .class6
.class7 .class8 or .class9
.class9 .class10
.class9 .class10 .class11
.class9 .class10 .class1
then
awk 'BEGIN{FS=OFS=" "} {c=$1 FS $3; if (c in a) a[c]=a[c] FS $2; else a[c]=$2} END{for (k in a) print k " and", a[k]}' file1.txt > file2.txt
gives
.class4 and .class5 .class6
.class9 and .class10
.class7 or and .class8
.class1 .class2 and or
.class9 .class11 and .class10
.class9 .class12 and .class10
CodePudding user response:
This might work for you (GNU sed):
sed -E 's/^((\S )*)(\S )\s*$/\1(\3)/
:a;$!{N;s/^(.*)(\(.*)\)\n\1(\S )$/\1\2 or \3)/;ta}
h;s/\(.*//;s/ / and /g;x;s/.*\(/(/;H;x;s/\n//
s/\((\S )\)/\1/;P;D' file
Surround the last class by parens.
Append another line and if that lines size matches the previous, reduce the two lines to the size of the first with the last field of the second line included within the first lines parens separated by or
.
If the size of the append line is not the same, replace the spaces to the left of the parens by and
.
If parens enclose a singe word, remove them.
Print then delete the first line and repeat.
N.B. By size understand number of words as well as duplicate keys where non-duplicates keys (if present) indicate a change in size. Also the file may need sorting by the number of words per line, use:
sed 's/.*/echo "&"|wc -c/e;G;s/\n/ /' file | sort -ns | sed 's/^\S //' > newFile
CodePudding user response:
Would you please try an awk solution. The code is posix compliant and should work with macOS:
awk '
# concatenate classes except for the rightmost one
# if str has single class, return "_"
# the returned value is used to index to an array "rights"
function join(str, delim, n, i, a, x) {
n = split(str, a)
x = "_"
for (i = 1; i < n; i ) {
x = x delim a[i]
}
return x
}
{
left = join($0, " and ") # concatenate classes except for the right
right = $NF # rightmost class
if (left in rights) { # append the "right" to the list indexed by "left"
rights[left] = rights[left] " or " right
} else {
rights[left] = right
lefts[ c] = left # keep the order of "left"s
}
}
END {
for (i = 1; i <= c; i ) { # loop over the "left"s
left = lefts[i]
right = rights[left]
if (right ~ / or /) right = "(" right ")"
# surround multiple classes with parens
if (left == "_") { # the line has single class
print right
} else {
sub(/^_ and /, "", left) # trim the unnecessary substring off
print left " and " right
}
}
}
' file.txt
Result with the provided example file:
(.class1 or .class2 or .class3 or .class4 or .class9)
.class4 and (.class5 or .class6)
.class7 and .class8
.class9 and .class10
.class9 and .class10 and (.class11 or .class12)