I'm trying to update a string field in Elasticsearch using Painless script to regex extract from another field. This is being invoked from Python e.g:
es.update_by_query(index='testrss', query=qry, script=scr)
In my example the qry
filter returns only 1 record with the following value:
{'body_text': "Purpose prong invitations Homely wine pocketses\nSOURCE: THE NY TIMES, NEW YORK\nReaches stealing jambags Azog pull ask" }
I want to extract THE NY TIMES, NEW YORK
into a new field testxy
.
To test with a working scr
input example: the following works fine:
scr = {
"lang": "painless",
"source": "ctx._source.testxy = /[aeiou]/.matcher(ctx._source.body_text).replaceAll('')"
}
..updating testxy
to this:
{
...
'_source': {'testxy': 'Prps prng nvttns Hmly wn pcktss\nSOURCE: THE NY TIMES, NEW YORK\nRchs stlng jmbgs Azg pll sk',
...
}
However regex string extraction is failing:
scr = {
"lang": "painless",
"source": "ctx._source.testxy = /SOURCE.*?\n/.matcher(ctx._source.body_text).group(1)"
}
errors with:
---------------------------------------------------------------------------
BadRequestError Traceback (most recent call last)
/var/folders/8l/d9m87qtx2yn1bc86txmr30wh0000gn/T/ipykernel_57473/2559631365.py in <module>
----> 1 es.update_by_query(index='testrss', query=qry, script=scr)
/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/_sync/client/utils.py in wrapped(*args, **kwargs)
412 pass
413
--> 414 return api(*args, **kwargs)
415
416 return wrapped # type: ignore[return-value]
/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/_sync/client/__init__.py in update_by_query(self, index, allow_no_indices, analyze_wildcard, analyzer, conflicts, default_operator, df, error_trace, expand_wildcards, filter_path, from_, human, ignore_unavailable, lenient, max_docs, pipeline, preference, pretty, query, refresh, request_cache, requests_per_second, routing, script, scroll, scroll_size, search_timeout, search_type, slice, slices, sort, stats, terminate_after, timeout, version, version_type, wait_for_active_shards, wait_for_completion)
4715 if __body is not None:
4716 __headers["content-type"] = "application/json"
-> 4717 return self.perform_request( # type: ignore[return-value]
4718 "POST", __path, params=__query, headers=__headers, body=__body
4719 )
/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/_sync/client/_base.py in perform_request(self, method, path, params, headers, body)
319 pass
320
--> 321 raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
322 message=message, meta=meta, body=resp_body
323 )
BadRequestError: BadRequestError(400, 'script_exception', 'compile error')
I've also tried:
scr = {
"lang": "painless",
"source": "Pattern p = Pattern.compile(\"SOURCE\"); Matcher m = p.matcher(ctx._source.body_text); ctx._source.testxy = m.group(1)"
}
..which also fails. Any idea what I'm doing wrong?
Edit. Error from running this in Dev Tools console:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"java.base/java.util.regex.Matcher.group(Matcher.java:644)",
"ctx._source.testxy = /SOURCE/.matcher(ctx._source.body_text).group(1)",
" ^---- HERE"
],
"script" : "ctx._source.testxy = /SOURCE/.matcher(ctx._source.body_text).group(1)",
"lang" : "painless",
"position" : {
"offset" : 60,
"start" : 0,
"end" : 69
}
}
],
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"java.base/java.util.regex.Matcher.group(Matcher.java:644)",
"ctx._source.testxy = /SOURCE/.matcher(ctx._source.body_text).group(1)",
" ^---- HERE"
],
"script" : "ctx._source.testxy = /SOURCE/.matcher(ctx._source.body_text).group(1)",
"lang" : "painless",
"position" : {
"offset" : 60,
"start" : 0,
"end" : 69
},
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "No match found"
}
},
"status" : 400
}
Confusing. No match found
yet I can remove the target text with /SOURCE.*?\\n/.matcher(ctx._source.body_text).replaceAll('')
.
CodePudding user response:
Found the solution here. You have to make a call to matcher.find()
or matcher.matches()
before you can invoke .group()
. Who the feck knows why.
scr = {
"lang": "painless",
"source": "Matcher m = /(?<=SOURCE:).*?(?=\\n)/.matcher(ctx._source.body_text); boolean b = m.find(); ctx._source.testxy = m.group(0)"
}