Splunk : Record deduplication using an unique field

Question

We are considering moving out log analytics solution from ElasticSearch/Kibana to Splunk.

We currently use "document id" in ElasticSearch to deduplicate records when indexing :

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

We generate the id using hash of the content of the each log-record.

In Splunk, I found the internal field "_cd" which is unique to each record in Splunk index: https://docs.splunk.com/Documentation/Splunk/8.1.0/Knowledge/Usedefaultfields

However, using HTTP Event Collector to ingest records, I couldn't find any way to embed this "_cd" field in the request : https://docs.splunk.com/Documentation/Splunk/8.1.0/Data/HECExamples

Any tips on how to achieve this in Splunk ?

RichG · Answer 1 · 2020-12-03T15:35:21.797

HEC inputs don't go through the usual ingestion pipeline so not all internal fields are present.

Not that it matters, really, because Splunk doesn't deduplicate at index time. There is no provision for searching data to see if a given record is already present. Any deduplication must be done at search time.

One cannot use the _cd field to deduplicate at search time because two identical records will have different _cd values.

Consider using a tool such as Cribl to add a hash to each ingested record and use that hash in Splunk to deduplicate in your searches.

score 2 · Accepted Answer · answered Dec 02 '20 at 13:28

What are you trying to achieve?

If you're sending "unique" events to the HEC, or you're running UFs on "unique" logs, you'll never get duplicate "records when indexing".

It sounds like you (perhaps routinely?) resend the same data to your aggregation platform - which is not a problem with the aggregator, but with your sending process.

Almost like you're doing a MySQL/PostgreSQL "insert if not exists" operation. If that is a correct understanding of your situation, based on your statement

We currently use "document id" in ElasticSearch to deduplicate records when indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.

then you need to evaluate what is going "wrong" in your sending process that you feel you need to pre-clean the data before ingesting it.

It is true that Splunk won't "deduplicate records when indexing" - because it presumes the data coming-in to be 'correct' from whatever is submitting it.

How are you getting duplicate data in the first place?

Fields in Splunk which begin with the underscore (eg _time, _cd, etc) are not editable/sendable - they're generated by Splunk when it receives data. IOW, they're all internal fields. Searchable. Usable. But not overrideable.

If you really have a problem with [lots of/too much] duplicate data, and there is no way to fix your sending process[es], then you'll need to rely on deduplication operations in SPL when searching for/reporting on whatever you've ingested (primarily by using stats and, when absolutely necessary/unavoidable, dedup).

Splunk : Record deduplication using an unique field

2 Answers2