Opensearch - best practice for indexing-CodePudding

I have ~1 TB of old apache log data I would like to index in Opensearch. Logs are per day and structured like: s3://bucket/logdata/year/year_month_day.json.gz

I plan to use logstash for the ingest and wonder the best way to index(es) to get performance? I would like to index per day but how do extract the date from the logfile name above to get it right in the logstash conf file?

index = > "%{ YYYY.MM.dd}" will solve the future logfiles but how do I solve it for the old ones?

CodePudding user response：

You can do it like this using the dissect filter that can parse the date components from the bucket key and reconstruct the date into a new field called log_date:

dissect {
    mapping => {
        "[@metadata][s3][key]" => "%{ignore}/logdata/%{ ignore}/%{year}_%{ month}_%{day}.json.gz"
    }
    add_field => {
       "log_date" => "%{year}-%{month}-%{day}"
    }
    remove_field => ["ignore"]
}

Then in your output section you can reference that new field in order to build your index name:

index = > "your-index-%{log_date}"

PS: another way is to parse the year_month_day part as one token and replace the _ characters with - using mutate/gsub

CodePudding user response：

In my experience, daily indices can quickly run out of control: they vary in size greatly, with a decent retention period cluster might get oversharded, etc. I would recommend to set up ILM rollover with policy based on both index age (7 or 30 days, depending on logging volume) and primary shard size (common threshold is 50GB). You can also set up a delete phase as well in the same policy, based on your retention period.

This way you'll get optimal indexing and search performance, as well as uniform load distribution and resource usage.