How can color names be more accurately recognised and extracted from strings?-CodePudding

It may be a naïve approach that I use to recognise and extract colour names despite slight variations or misspellings in texts, which in a first throw also works better in English than in German, but the challenges seem to be approximately the same.

Different spellings grey/gray or weiß/weiss where the similarity from a human perspective does not seem to be huge but from word2vec grey and green are more similar.
Colours not yet known or available in color_list, in the following case brown May not best example, but perhaps it can be deduced from the context in the sentence. Just as you as a human being get an idea that it could be a color.

Both cases could presumably be covered by an extension of the vocabulary with a lot of other color names. But, not knowing about such combinations in the first place seems difficult.

Does anyone see another adjusting screw or even a completely different procedure that could possibly achieve even better results?

from collections import Counter
from math import sqrt
import pandas as pd

#list of known colors
colors = ['red','green','yellow','black','gray']

#dict or dataframe of sentences that contains color/s or not
df = pd.DataFrame({
    'id':[1,2,3,4],
    'text':['grey donkey with black mane',
    'brown dog with sharp teeth',
    'red cat with yellowish / seagreen glowing eyes',
    'proud rooster with red comb']
    }
)

#creating vector of the word
def word2vec(word):
    cw = Counter(word)
    sw = set(cw)
    lw = sqrt(sum(c*c for c in cw.values()))
    return cw, sw, lw

#check cosin distance between word and color
def cosdis(v1, v2):
    common = v1[1].intersection(v2[1])
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]


df['color_matches'] = [[(w,round(cd, 2),c) for w in s.split() for c in colors if (cd:=cosdis(word2vec(c), word2vec(w))) >= 0.85] for s in df.text]

	id	text	color_matches
0	1	grey donkey with black mane	[('black', 1.0, 'black')]
1	2	brown dog with sharp teeth	[]
2	3	red cat with yellowish / seagreen glowing eyes	[('red', 1.0, 'red'), ('yellowish', 0.85, 'yellow'), ('seagreen', 0.91, 'green')]
3	4	proud rooster with red comb	[('red', 1.0, 'red')]

CodePudding user response：

Your best strategy is to begin by having a list of colors ahead of time. A color includes adjectives which contain the word "color" in their definition. I said includes because this doesn't cover all cases: The type of edge case that can kill you would be something like

"The yellowish jacket matched his yellow shoes".

This has the problem that "yellowish" in the oxford dictionary is defined as:

adjective: having a yellow tinge; slightly yellow.

Now you can do a little bit of recursion here:

First colors are adjectives which contain the word "color" in their definition

Second colors are adjectives which contain a <first color> in their definition

Third colors are adjectives which contain a <second color> in their definition.

etc...

Mining this from a dictionary data set can let you scoop up as many colors as possible. You might need to be a little careful here though and only select adjectives whose definitions include a phrase of the form adverb color_of_lower_rank

Once you have a set of colors then compound colors example "blue-green" become tractable. There are also ill defined colors such as "royal blue". Parsing these is more difficult because you need to know if the "royal" refers to the blue OR to the object ex:

"The prince's royal blue cloak was beautiful"

The royal property here has to do with the fact its a prince's cloak.

"The shirt was a beautiful royal blue".

Here you can just imagine a shirt that colored a beautiful shade of blue that you consider "royal blue".

So in general parsing adverb-adjective phrases can get a bit complex.

CodePudding user response：

The space of color names is so small, compared to all words, that I'd suggest you mainly focus on hand-curated lists of colors.

You could initially extract such a lexicon from reference materials, whether they're general references (like WordNet) or domain-specific documentation (like say HTML Color Names - even if you limited to the subset of one-word names).

If you then need to expand your fixed-list to include other novel colors, I'd expect that the words very-close to known-colors, in a suitably well trained word2vec model, are likely to be other, and related, colors. You'd probably need some manual review, but again: the space of colors isn't that large, so a manual process seems reasonable.

Your comment that word2vec placed green closer to grey than gray surprises me. Are you sure you weren't using some severely-impoverished word2vec model, far undertrained on insufficient data, or poorly-parameterized? Placing variant-spellings of the same concept near each other is something a word2vec model, trained on sufficient data, usually does very very well. Have you tried looking at the nearest-neighbors of known color-names in a large, sufficiently-trained word2vec model, like the large 3M word GoogleNews model released by Google circa 2013?

(Note: no toy-sized examples, like the 4-text, ~25-word dataset showin in your code, will show much useful from word2vec-style algorithms, which require many subtly-contrasting realistic uses of words in context to generate good vectors for those words.)