Combine or group lines by same words using macOS sed or awk or grep or gsub-CodePudding

Combine or group lines by same words using macOS sed or awk or grep or gsub (prefer awk):

If line has single class then combine using "or" such as

.class1

.class2

.class3

to

(.class1 or .class2 or .class3)
If line already has two or more class and any classes are matching with other lines then it combined following ways:

.class4 .class5

.class4 .class6

.class9 .class10 .class11

.class9 .class10 .class12

to

.class4 and (.class5 or .class6)

.class9 and .class10 and (.class11 or .class12)

Here is an example of text file

file.txt

.class1
.class2
.class3
.class4
.class4 .class5
.class4 .class6
.class7 .class8
.class9
.class9 .class10 
.class9 .class10 .class11
.class9 .class10 .class12

expected

(.class1 or .class2 or .class3 or .class4 or .class9)
.class4 and (.class5 or .class6)
.class7 and .class8
.class9 and .class10
.class9 and .class10 and (.class11 or .class12)

Here is what I tried:

awk '/ /{if (x)print x;x="";}{x=(!x)?$0:x" or "$0;}END{print x;}' file.txt > file1.txt

got following result:

.class1 or .class2 or .class3 or .class4
.class4 .class5
.class4 .class6
.class7 .class8 or .class9
.class9 .class10 
.class9 .class10 .class11
.class9 .class10 .class1

then

awk 'BEGIN{FS=OFS=" "} {c=$1 FS $3; if (c in a) a[c]=a[c] FS $2; else a[c]=$2} END{for (k in a) print k " and", a[k]}' file1.txt > file2.txt

gives

.class4  and .class5 .class6
.class9  and .class10
.class7 or and .class8
.class1 .class2 and or
.class9 .class11 and .class10
.class9 .class12 and .class10

CodePudding user response：

This might work for you (GNU sed):

sed -E 's/^((\S  )*)(\S )\s*$/\1(\3)/
        :a;$!{N;s/^(.*)(\(.*)\)\n\1(\S )$/\1\2 or \3)/;ta}
        h;s/\(.*//;s/ / and /g;x;s/.*\(/(/;H;x;s/\n//
        s/\((\S )\)/\1/;P;D' file

Surround the last class by parens.

Append another line and if that lines size matches the previous, reduce the two lines to the size of the first with the last field of the second line included within the first lines parens separated by or.

If the size of the append line is not the same, replace the spaces to the left of the parens by and.

If parens enclose a singe word, remove them.

Print then delete the first line and repeat.

N.B. By size understand number of words as well as duplicate keys where non-duplicates keys (if present) indicate a change in size. Also the file may need sorting by the number of words per line, use:

sed 's/.*/echo "&"|wc -c/e;G;s/\n/ /' file | sort -ns | sed 's/^\S  //' > newFile

CodePudding user response：

Would you please try an awk solution. The code is posix compliant and should work with macOS:

awk '
# concatenate classes except for the rightmost one
# if str has single class, return "_"
# the returned value is used to index to an array "rights"
function join(str, delim,    n, i, a, x) {
    n = split(str, a)
    x = "_"
    for (i = 1; i < n; i  ) {
        x = x delim a[i]
    }
    return x
}
{
    left = join($0, " and ")            # concatenate classes except for the right
    right = $NF                         # rightmost class
    if (left in rights) {               # append the "right" to the list indexed by "left"
        rights[left] = rights[left] " or " right
    } else {
        rights[left] = right
        lefts[  c] = left               # keep the order of "left"s
    }
}
END {
    for (i = 1; i <= c; i  ) {          # loop over the "left"s
        left = lefts[i]
        right = rights[left]
        if (right ~ / or /) right = "(" right ")"
                                        # surround multiple classes with parens
        if (left == "_") {              # the line has single class
            print right
        } else {
            sub(/^_ and /, "", left)    # trim the unnecessary substring off
            print left " and " right
        }
    }
}
' file.txt

Result with the provided example file:

(.class1 or .class2 or .class3 or .class4 or .class9)
.class4 and (.class5 or .class6)
.class7 and .class8
.class9 and .class10
.class9 and .class10 and (.class11 or .class12)