How to find list of unique affixes given a list of words?-CodePudding

An affix can be a prefix (before word), infix (in the middle of a word), or suffix (after word). I have a list of 200k latin/greek names used in biological taxonomy. It turns out there is no centralized list of all the affixes used in the taxonomy, unfortunately, other than this very basic list.

The question is, how can I take that 200k list of latin/greek names, and divide it into a list of affixes (ideally using just plain JavaScript)?

I don't really know where to begin on this one. If I construct a trie, I need to somehow instead test for specific chunks of words. Or if the chunk can be extended, don't include the chunk until we reach a final extension of some sort...

const fs = require('fs')
const words = fs.readFileSync(`/Users/lancepollard/Downloads/all.csv`, 'utf-8').trim().split(/\n /)
const trie = { children: {} }

words.forEach(word => addToTrie(trie, word))

function addToTrie(trie, word) {
  let letters = word.trim().split('')
  let node = trie
  let i = 0
  while (i < letters.length) {
    let letter = letters[i  ]
    node = node.children[letter] = node.children[letter] || { children: {} }
  }
  node.isWord = true
}

It doesn't need to be exact, like each affix actually means something, it can be dirty (in that, some words mean something, some words don't). But it shouldn't just list every permutation of a word's letters sort of thing. It should include things which are "potential affix candidates", which are chunks which appear more than once in the list. This will at least get me partway there, and I can then manually go through and look up the definitions for each of these "chunks". Ideally, it should also tell whether it is a prefix/infix/suffix. Maybe the output is a CSV format affix,position.

You can get creative in how this is solved, as without knowing a list of possible affixes in advance, we don't know what the exact output should be. This is basically to try and find the affixes, as best as possible. If it includes things like aa- as a prefix, for example, which is probably a common sequence of letters yet I don't think is an affix, that is fine with me, it can be filtered out manually. But if there are two words (I am making this up), say abrogati and abrowendi, then abro would be a "common prefix", and that should be included in the final list, not abr, ab, and a, even though those are common too. Basically, the longest common prefix. However, if we have the words apistal and ariavi, we could say that a is a common prefix, so our final list would include a and abro.

To go into slightly more detail, say we have these two words aprineyanilantli and aboneyanomantli, they have the common prefix a-, and the common suffix -antli, as well as the infix -neyan-, so those should be in the final list.

It doesn't necessarily need to be efficient, as this is only going to run theoretically once, on the 200k list. But if it efficient as well, that would be bonus. Ideally though it shouldn't take hours to run, though I am not sure what's possible :)

Another example is this:

brevidentata
brevidentatum
brevidentatus
crassidentata
crassidentatum
crassidentatus

Here, the first 3 have a common prefix, brevidentat, then 2-3 have the common prefix brevidentatu. But later (with human knowledge), we find identat is probably the infix we desire, and a/um/us are word form suffixes. Also, we see that identat is an infix in the two words crass... and brev.... So the end result should be:

brav-
crass-
-identat-
-a
-us
-um

That, in theory, would be the ideal outcome. But you could also have this:

brav-
crass-
-identat-
-identata
-identatus
-identatum

That would also work, and we could do some simple filtering to filter those out later.

CodePudding user response：

Here is a simple approach, but it is probably in the hours period. Also, you could do it in JavaScript, but I'll take a generally Unixy approach that you could write in any language because that is simple to think about.

First, let's take your file, and add markers to the start/end of each word, and spaces between the letters. So your example would become:

^ b r e v i d e n t a t a $
^ b r e v i d e n t a t u m $
^ b r e v i d e n t a t u s $
^ c r a s s i d e n t a t a $
^ c r a s s i d e n t a t u m $
^ c r a s s i d e n t a t u s $

This is our general representation, space separated possible affixes. With the basic affixes being letters, begin, and end. Here we have, of course, found no affixes.

Here is what a single affix search pass looks like.

Take our file, and create tempfile of the distinct possible affix sections, followed by the line number of the word. (I say distinct so that if line 666 contains a b a b you don't get a b: 666 twice.) So our file starts off:

 ^ b: 1
 ^ b r: 1
 .
 .
 .
 ^ c r a s s i d e n t a t u s $: 6

Next we sort the file (just use the Unix LC_ALL=C sort tempfile > sortedtempfile command, the LC_ALL forces asciibetical sort). You now generate sortedtempfile which starts off:

 ^ b: 1
 ^ b: 2
 .
 .
 .
 ^ c r a s s i d e n t a t u s $: 6

Next run a custom command to give for each prefix that appears at least, say, 2 times, how many symbols you save using this as an affix, followed by the affix, followed by a list of lines where it appears. This generates a file tempsaved that starts off:

 3: ^ b: 1 2 3
 6: ^ b r e: 1 2 3
 .
 .
 .
 16: v i d e n t a t u: 2 3

Now do sorted -rn tempsaved > sortedtempsaved to sort from maximum savings to find the biggest savings first. This file now starts off

 36: ^ c r a s s i d e n t a t: 4 5 6
 33: ^ b r e v i d e n t a t: 1 2 3
 36: ^ c r a s s i d e n t a: 4 5 6

In the next function, we identify affixes until we encounter 2 on the same line number. Then go back to our original file and apply those. So in this pass we'd identify ^crassidentat and ^brevidentat. Then produce a new file which contains:

^brevidentat a $
^brevidentat u m $
^brevidentat u s $
^crassidentat a $
^crassidentat u m $
^crassidentat u s $

Now repeat.

In your example you'll wind up with the following set of affixes:

^crassidentat
^brevidentat
um$
us$
a$

If you added the words identata, identatum and identatus to the original list, the same algorithm would generate the following list of affixes instead

identat
^crass
^brev
um$
us$
a$

which is your stated ideal outcome.

My back of the envelope says that you should expect each pass to take several minutes. But we try to find lots of affixes per pass. So I wouldn't expect this to take more than a few dozen passes. Also the list will need human review afterwards. I don't think that there is much avoiding that.

CodePudding user response：

Prefixes and Suffixes are easy with a Trie. However, a Trie won't help you with infixes.

Sample code for Trie (in Java, untested, incomplete)

class Node {
    private int cnt;
    private Map<Character, Node> children;

    Node() {
        cnt = 0;
        this.children = new HashMap<>();
    }

    Node(String s, int pos) {
        this();
        addChild(s, pos);
    }

    bool isLeaf() {
        return this.children.size() == 0
    }

    void addChild(String s, int pos) {
        if (pos == s.length()) {
            return;
        }

        char c = s.charAt(pos);
        if (children.containsKey(c)) {
            children.get(c).addChild(s, pos   1);
        } else {
            children.put(c, new Node(s, pos   1));
        }
        cnt  ;
    }

    void removeChild(char c) {
        int ccnt = 0;
        Node child = children.remove(c);
        if (child != null) {
            ccnt = child.cnt;
        }
        cnt -= ccnt;
    }

    // other methods as necessary for traversal/value lookup...
}

class Solution {
    private Node preroot = new Node();
    private Node sufroot = new Node();

    void addWord(String s) {
        preroot.addChild(s, 0);
        sufroot.addChild(new StringBuilder(s).reverse().toString(), 0);
    }

    void findPrefixes(int minOccur) {
        // standard tree traversal on preroot,
        // starting at the left-most leaf.
        // when it finds a non-leaf with cnt >= minOccur
        // output all permutations and remove the child.
    }
}

Infixes

The problem with infixes is that you don't know where to start. i.e. take the strings abcdefgh and pppbcdefgzzzz, which have the common infix bcdefg. furthermore, how about abcdefgh and pppabcdefgzzz?

To solve this, you'll basically need to chop the words into all of its possible constituents, and point back to the word. Then iterate through the list of chops, sorted by descending length, and remove all entries associated with "used" words.

i.e. abc would become the lookup entries: abc, ab, bc, a, b, c. Then a lookup table would look like:

Word association to symbols:

{abc -> {abc, ab, bc, a, b, c}}

Map:

{abc -> { abc }}
{ab -> { abc }}
{bc -> { abc }}
{a -> { abc }}
{b -> { abc }}
{c -> { abc }}

when we add bcd, which adds the symbols: bcd, bc, cd, b, c, d, the word association is added and the lookup table gets updated:

{abc -> { abc }}
{bcd -> { bcd }}
{ab -> { abc }}
{bc -> { abc, bcd }}
{cd -> { bcd }}
{a -> { abc }}
{b -> { abc, bcd }}
{c -> { abc, bcd }}
{d -> { bcd }

Then use the length of the key for the map to dictate the sort order. Starting from the top, navigate until minimum occurrences is reached and then use the words in that list and remove the words from the construct. Removing the word from the map uses the word association saved earlier to look up the keys in the symbols map.