Home > database >  Get Significant Text aggregation on text field with stop words filtering
Get Significant Text aggregation on text field with stop words filtering

Time:07-14

I'm trying to search for the most frequent words in the text field of my index(which is called "text"). I've managed to use the "significant text" aggregation in order to do this, but some of the buckets returned contain words like "the", "a", "they" etc. How can I filter those out? I tried using a stop word analyzer, but it still didn't help. I also tried using "gnd", which is said helps with this problem, but I still got about the same results.

my query:

GET feed/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "by_sentiment": {
        "terms": {
            "field": "sentiment.Sentiment.keyword",
            "size": 50
        },
        "aggs": {
            "trending_topics": {
                "significant_text": {
                    "field": "text",
                    "filter_duplicate_text": true,
                }
            }
        }
    },
    "by_level": {
        "terms": {
            "field": "level",
            "size": 50
        },
        "aggs": {
            "trending_topics": {
                "significant_text": {
                    "field": "text",
                    "filter_duplicate_text": true,
                }
            }
        }
    },
    "by_asset": {
        "terms": {
            "field": "asset_id",
            "size": 50
        },
        "aggs": {
            "trending_topics": {
                "significant_text": {
                    "field": "text",
                    "filter_duplicate_text": true,
                }
            }
        }
    }
  }
}

CodePudding user response:

I managed to do this by adding an

"exclude": ["list","of","stop","words"]

to each "significant_text" aggregation. This is the exact list I used, for anyone who is interested:

"exclude": ["t.co", "https", "rt", "l", "they", "i", "I", "you", "this", "that", "but", "its", "s", "for", "there", "going", "try", "into", "me", "don’t", "every", "because", "got", "thank", "thanks", "looks", "cha", "been", "would", "my", "from", "now", "and", "im", "mine", "u", "the", "to", "can't", "than", "cant", "in", "self", "of", "with", "your", "is", "do", "not", "ii", "despite", "however", "there's", "isn't", "seems", "though", "a", "via", "will", "also", "that's", "even", "we", "anymore", "anyone", "all", "have", "on", "if", "sure", "as", "at", "are", "it", "so", "be", "are", "everyone", "just", "can", "by", "what", "does", "please", "an", "these", "de", "how", "he", "haha", "were", "us", "should", "when", "or", "o", "another", "those", "am", "yourselves", "don't", "without", "then", "gotta", "myself", "we'll", "our", "we've", "www.reddit.com", "know", "number", "which", "while", "name", "comments", "up", "you're", "seem", "isn't", "being", "them", "ha", "perhaps", "about", "has", "each", "something", "haven't", "their", "t.me", "r", "est", "la", "le", "vous", "et", "à", "les", "pour", "avec", "el", "en", "que", "para", "no"]
  • Related