Home > database >  ElasticSearch keyword match vector
ElasticSearch keyword match vector

Time:11-07

first of all, sorry if what I'm asking is stupid, but I'm very new to Elastic Search. Here's what I need to do: I have an array of keywords that I need to search for in every document of an index. Here's the mapping:

{
  "resumes": {
    "mappings": {
      "properties": {
        "text": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Knowing this, I need to search for all the words in the keyword array in every document, and for every document in the resume index, it would return a vector with 0 for the word if not found in the document, and 1 if it was found.

Eg.

keywords = ["javascript", "html", "python"]
doc1 = "Hello there, I've only programmed in python."
doc2 = "Hello there, I've only programmed in python and javascript."
doc3 = "Hello there, I've only programmed in python and javascript. Im now learning html"

Search results would be something like:

{
  "doc1": [0, 0, 1], // because it contains the word python
  "doc2": [1, 0, 1], // because it contains both python and javascript
  "doc3": [1, 1, 1]  // because it contains all words in the keyword vector
}

Is it even possible to do this with elastic search alone? I'm coding all this in Python, but I think if I filled these with Python itself, it would be way more inefficient than if elastic search could do it.

Haven't tried much yet, since I don't even know too well the capabilities of Elastic Search. I've searched a lot for it, but I'm not even aware where to start from...

CodePudding user response:

Using scripts in elasticsearch is not healthy because they are not performative. I managed to do what you want but I warn you about performance issues.

In field "vector_field" your have your matrix.

POST idx_teste/_doc 
{
  "description": "Hello there, I've only programmed in python."
}

POST idx_teste/_doc 
{
  "description": "Hello there, I've only programmed in python and javascript."
}

POST idx_teste/_doc 
{
  "description": "Hello there, I've only programmed in python and javascript. Im now learning html"
}

GET idx_teste/_search
{
  "_source": "*", 
  "query": {
    "terms": {
      "description": [
        "javascript","html","python"
      ]
    }
  }, 
   "script_fields": {
     "custom_field": {
       "script": {
         "source":  """
              def vector = new ArrayList();
              for(int i=0; i< params.keywords.size(); i  ){
                String text = doc['description.keyword'].value;
                if(text.contains(params.keywords[i])) {
                  vector.add(1);
                } else {
                  vector.add(0);
                }
              }
              return vector;
          """,
          "params": {
            "keywords" :[
                "javascript","html","python"
              ]
          }
       }
     }
   }
}

Response:

"hits": [
      {
        "_index": "idx_teste",
        "_id": "oIyiQ4QBgXg8h_rc0Ny3",
        "_score": 1,
        "_source": {
          "description": "Hello there, I've only programmed in python."
        },
        "fields": {
          "vector_field": [
            0,
            0,
            1
          ]
        }
      },
      {
        "_index": "idx_teste",
        "_id": "oYypQ4QBgXg8h_rcH9wU",
        "_score": 1,
        "_source": {
          "description": "Hello there, I've only programmed in python and javascript."
        },
        "fields": {
          "vector_field": [
            1,
            0,
            1
          ]
        }
      },
      {
        "_index": "idx_teste",
        "_id": "ooypQ4QBgXg8h_rcJ9y2",
        "_score": 1,
        "_source": {
          "description": "Hello there, I've only programmed in python and javascript. Im now learning html"
        },
        "fields": {
          "vector_field": [
            1,
            1,
            1
          ]
        }
      }
    ]
  • Related