I've been trying to build this ElasticSearch Query on the Danish CVR database API so far without success. Basically I'm trying to find companies where
- The company has a relationship with "deltager" (participant) with "enhedsNummer" (ID) equal NUMBER
- The relationship is still active, i.e. the "end of period" field is null
How do I construct a query that has multiple conditions like this?
'query': {
'bool': {
'must': [
{
'term': {'Vrvirksomhed.deltagerRelation.deltager.enhedsNummer': NUMBER},
AND
'term': {'Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil': null}
},
],
},
},
}
FYI: database mapping may be found at http://distribution.virk.dk/cvr-permanent/_mapping
CodePudding user response:
You can try:
GET /cvr-permanent/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": {
"value": "your_value_here"
}
}
}
],
"must_not": [
{
"exists": {
"field": "Vrvirksomhed.deltagerRelation.organisationer.attributter.vaerdier.periode.gyldigTil"
}
}
]
}
}
}
Trick here is to use must_not/exist
for nil values.
P.S. I cannot check it because it requires authorisation.
CodePudding user response:
It doesn't appear like ElasticSearch Queries are as dynamic as I had wanted (or I don't know how use them). Instead, it appears that the Python code below is the best choice for generating the desired outcome:
import requests
import pandas as pd
# creation of empty lists:
virksomhedsnavne = []
virksomhedscvr = []
relation_fra = []
relation_til = []
# Pulling data (apparently limited to 3000 elements at a time):
for i in range(20):
if i == 0:
highestcvrnummer = 0
else:
highestcvrnummer = max(virksomhedscvr)
headers = {
'Content-Type': 'application/json',
}
json_data = {
"_source": ["Vrvirksomhed.cvrNummer", "Vrvirksomhed.navne", "Vrvirksomhed.virksomhedMetadata.nyesteNavn.navn", "Vrvirksomhed.deltagerRelation"],
"sort" : [{"Vrvirksomhed.cvrNummer" : {"order":"asc"}}],
"query": {
"bool": {
"must": [
{
"term": {
"Vrvirksomhed.deltagerRelation.deltager.enhedsNummer": "some_value"
}
},
{
"range":{
"Vrvirksomhed.cvrNummer": {
"gt": highestcvrnummer
}
}
}
]
}
},
'size': 3000
}
response = requests.post('http://distribution.virk.dk/cvr-permanent/virksomhed/_search', headers=headers, json=json_data, auth=('USERNAME', 'PASSWORD'))
json_data = response.json()['hits']['hits']
# Aggregate and format data neatly
for data in json_data:
virksomhed_data = data['_source']['Vrvirksomhed']
virksomhedscvr.append(virksomhed_data['cvrNummer'])
try:
virksomhedsnavne.append(virksomhed_data['virksomhedMetadata']['nyesteNavn']['navn'])
except:
virksomhedsnavne.append(virksomhed_data['navne'][0]['navn'])
# Loop through all "deltagere" and find match with value
for relation in virksomhed_data['deltagerRelation']:
# If match found
if relation['deltager']['enhedsNummer'] == some_value:
# Make sure most recent period is chosen
antalopdateringer = len(relation['organisationer'])-1
relation_gyldig = relation['organisationer'][antalopdateringer]['medlemsData'][0]['attributter'][0]['vaerdier'][0]['periode']
relation_fra.append(relation_gyldig['gyldigFra'])
relation_til.append(relation_gyldig['gyldigTil'])
break
#export to excel
dict = {'CVR nummer':virksomhedscvr, 'navn':virksomhedsnavne, 'Relation fra':relation_fra, 'Relation til':relation_til}
df = pd.DataFrame(dict)
df.to_excel("output.xlsx")
If anyone else is working with the Danish CVR register's API, I hope this helps!
Also, if you find a better solution, please let me know :)