Home > database >  ElasticSearch: How to query by multiple conditions in different locations?
ElasticSearch: How to query by multiple conditions in different locations?

Time:07-14

I've been trying to build this ElasticSearch Query on the Danish CVR database API so far without success. Basically I'm trying to find companies where

  1. The company has a relationship with "deltager" (participant) with "enhedsNummer" (ID) equal NUMBER
  2. The relationship is still active, i.e. the "end of period" field is null

How do I construct a query that has multiple conditions like this?

    'query': {
        'bool': {
            'must': [
                {
                    'term': {'Vrvirksomhed.deltagerRelation.deltager.enhedsNummer': NUMBER},
        AND            
                    'term': {'Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil': null}
                },
            ],
        },
    },
}

FYI: database mapping may be found at http://distribution.virk.dk/cvr-permanent/_mapping

CodePudding user response:

You can try:

GET /cvr-permanent/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": {
              "value": "your_value_here"
        }
      }
    }
  ],
  "must_not": [
    {
      "exists": {
        "field": "Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil"
       }
     }
   ]
  }
 }
}

Trick here is to use must_not/exist for nil values.

P.S. I cannot check it because it requires authorisation.

CodePudding user response:

It doesn't appear like ElasticSearch Queries are as dynamic as I had wanted (or I don't know how use them). Instead, it appears that the Python code below is the best choice for generating the desired outcome:

import requests
import pandas as pd

# creation of empty lists:
virksomhedsnavne = []
virksomhedscvr = []
relation_fra = []
relation_til = []

# Pulling data (apparently limited to 3000 elements at a time):
for i in range(20):

    if i == 0:
        highestcvrnummer = 0
    else:
        highestcvrnummer = max(virksomhedscvr)
    
    headers = {
        'Content-Type': 'application/json',
    }

    json_data = {

        "_source": ["Vrvirksomhed.cvrNummer", "Vrvirksomhed.navne", "Vrvirksomhed.virksomhedMetadata.nyesteNavn.navn", "Vrvirksomhed.deltagerRelation"],

        "sort" : [{"Vrvirksomhed.cvrNummer" : {"order":"asc"}}],
      
        "query": {
            "bool": {
                "must": [
                    {
                        "term": {
                            "Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": "some_value"
                        }
                    },
                    {
                        "range":{
                            "Vrvirksomhed.cvrNummer": {
                                "gt": highestcvrnummer
                            }
                        
                        }
                    }
                ]
            }
        },
        'size': 3000
    }

    response = requests.post('http://distribution.virk.dk/cvr-permanent/virksomhed/_search', headers=headers, json=json_data, auth=('USERNAME', 'PASSWORD'))
    json_data = response.json()['hits']['hits']

    # Aggregate and format data neatly
    for data in json_data:
        virksomhed_data = data['_source']['Vrvirksomhed']
        virksomhedscvr.append(virksomhed_data['cvrNummer'])
        try:
            virksomhedsnavne.append(virksomhed_data['virksomhedMetadata']['nyesteNavn']['navn'])
        except:
            virksomhedsnavne.append(virksomhed_data['navne'][0]['navn'])
        
        # Loop through all "deltagere" and find match with value
        for relation in virksomhed_data['deltagerRelation']:

            # If match found
            if relation['deltager']['enhedsNummer'] == some_value:
                
                # Make sure most recent period is chosen
                antalopdateringer = len(relation['organisationer'])-1
                relation_gyldig = relation['organisationer'][antalopdateringer]['medlemsData'][0]['attributter'][0]['vaerdier'][0]['periode']
                relation_fra.append(relation_gyldig['gyldigFra'])
                relation_til.append(relation_gyldig['gyldigTil'])
                break

#export to excel
dict = {'CVR nummer':virksomhedscvr, 'navn':virksomhedsnavne, 'Relation fra':relation_fra, 'Relation til':relation_til}
df = pd.DataFrame(dict)
df.to_excel("output.xlsx")

If anyone else is working with the Danish CVR register's API, I hope this helps!

Also, if you find a better solution, please let me know :)

  • Related