Home > other >  How to query text file version of LibreOffice Thesaurus in bash (joining lines)
How to query text file version of LibreOffice Thesaurus in bash (joining lines)

Time:07-08

I am trying to write a simple script in bash to query the LibreOffice thesaurus extension as a text file. For each input query string, I want the output to be all the related strings. And I want to do this in bash.

To download and extract the thesaurus, I do

wget "https://extensions.libreoffice.org/assets/downloads/41/1653961771/dict-en-20220601_lo.oxt" # download LO dictionary & thesaurus

unzip -p dict-en-20220601_lo.oxt th_en_US_v2.dat > lo # extract contents of thesaurus to text file

Taking a look at part of the text file:

nine|3
(adj)|9|ix|cardinal (similar term)
(noun)|9|IX|niner|Nina from Carolina|ennead|digit (generic term)|figure (generic term)
(noun)|baseball club|ball club|club|baseball team (generic term)
nine-banded armadillo|1
(noun)|peba|Texas armadillo|Dasypus novemcinctus|armadillo (generic term)
nine-fold|1
(adj)|nonuple|ninefold|multiple (similar term)
nine-membered|1
(adj)|9-membered|membered (similar term)
nine-sided|1
(adj)|multilateral (similar term)|many-sided (similar term)
nine-spot|1
(noun)|spot (generic term)

So for example, I want to be able input "nine" as a query and have bash return something like

9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team

I think this should be fairly easy to do using the right syntax with awk or sed, especially since all of the lines containing query terms do NOT begin with "(" and all of the line containing related terms DO begin with "(".

But I'm still somewhat of a newbie, and haven't been able to figure it out yet. The crux of the matter for me seems to be getting the query term and all related terms onto a single line. From there, I know how to sed my way to victory. But getting to that point has proven challenging for me.

TIA for your help!

p.s. I'm trying to do something similar to this, but my situation is a little different, and I don't understand the syntax well enough to modify it for my needs: https://www.unix.com/unix-for-dummies-questions-and-answers/184649-sed-join-lines-do-not-match-pattern.html

CodePudding user response:

Using sed

$ cat script.sed
N
{
    /\(/ {
        /9/!s/[^|]*\|//
        s/\n/ /
        {
            /[^|]*\|(9\|)/ { 
                s//\1/
                s/([^|]*)\|/\1\n/g
                s/\([^)]*\)//
                s/\([^)]*\)//g
                p
            }
        }
    }
}
$ sed -Enf script.sed input_file
nine
9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team

CodePudding user response:

If I understand your problem, an Awk solution

File search.awk:

#! /usr/bin/awk -f

BEGIN {
    # Field separator
    FS = "|"
}
$1 == KEY {
    # Key found, flag it
    flag  = 1
    # Associated words init
    words = ""
    next
}
flag == 1 && $0 ~ /^\(/ {
    # Association found
    # For all associations (field)
    idx = 2
    while (idx <= NF) {
        # get it
        word = $idx
        # remove term in parenthesis
        gsub(/ \(.*$/, "", word)
        # save it (with a separator)
        words = words "," word
        # next field
        idx  = 1
    }
}
flag == 1 && $0 !~ /^\(/ {
    # End of association
    # Print Key and words
    if (words != "") {
        print KEY words
    }
    # Reinit words
    words = ""
    flag = 0
}
END {
    # Special case, last word in thesaurus
    # Print Key and words
    if (words != "") {
        print KEY words
    }
}

Executable with:

chmod 755 ./search.awk

Used like this:

./search.awk -v KEY="nine" lo

Output:

nine,9,ix,cardinal,9,IX,niner,Nina from Carolina,ennead,digit,figure,baseball club,ball club,club,baseball team

CodePudding user response:

Do you mean the number 9? This produces your output. You need to prevent regex characters in the query.

read -p 'query? ' query
[[ $query =~ [[:alnum:]_-]  ]] &&
sed -n '
/^([^)]*)|\('"$query"'|.*\)/ {s/^([^)]*)|//; s/|/\
/g;p}' lo

Parsing the file properly is more complicated than this though.

CodePudding user response:

This might work for you (GNU sed):

v=nine
sed -n ':a;/^'"${v}"'|/{:b;n;/^[^(]/ba;s/^[^|]*|\| ([^)]*)//g;y/|/\n/;p;bb}' file

Focus on any lines following a match on the input variable.

Fetch the following line and if it does not begin with (, then repeat above.

Otherwise, remove the first field and any values between parens, replace the field separators | by newlines, print the result and repeat.

  • Related