I am trying to dedupe an array of JSON objects where the content and author are the same, but the timestamp is slightly different (i.e. within 1 second). I'd like to preserve the duplicated messages as a new field, called duplicates. For example, consider the following which has as entries 2,3 and 5 messages which should be deduped :
myObject = [
{content: 'content1', date: '1980-08-01 12:12:40.000', author: 'Person1'},
{content: 'content2', date: '1980-08-01 12:12:40.900', author: 'Person2'},
{content: 'content2', date: '1980-08-01 12:12:41.100', author: 'Person2'},
{content: 'content3', date: '1980-08-01 12:12:41.000', author: 'Person1'},
{content: 'content2', date: '1980-08-01 12:12:41.400', author: 'Person2'},
{content: 'content4', date: '1980-08-01 12:12:45.100', author: 'Person2'},
]
should be transformed to :
deduped = [
{content: 'content1', date: '1980-08-01 12:12:40.000', author: 'Person1', duplicates: 0},
{content: 'content2', date: '1980-08-01 12:12:40.900', author: 'Person2', duplicates: 2},
{content: 'content3', date: '1980-08-01 12:12:41.000', author: 'Person1', duplicates: 0},
{content: 'content4', date: '1980-08-01 12:12:45.100', author: 'Person2', duplicates: 0},
]
The part that I am having trouble with is the datetime. Sorting by datetime and then reducing is prone to errors if a non duplicate message occurs between the duplicates. And comparing the string value of the datetimes is also error prone because two messages may be very close together, but show as 1 second apart based on where they fall.
Using lodash _.uniqWith, I can dedupe based on the combination of a actual timedelta with identical content and author, but I lack the duplicates field...
const dedupedButNoCount = _.uniqWith(myObject, (item1, item2) =>
{return (item1.content== item2.content) && (item1.author== item2.author)
&& ((new Date(item1.date).getTime() - new Date(item2.date).getTime())<500)}
)
Any pointers on how to dedupe an array of objects with similar but not identical datetimes?
CodePudding user response:
I've done that, but I use a sort...
const
getTimeMs = YMDhmsx => // date string conversion to UTC (time zone = 0)
{
let [Y,M,D,h,m,s,x] = YMDhmsx.split(/\-|\.|\s|\:/).map(Number)
return (new Date(Date.UTC(Y,--M,D,h,m,s,x))).getTime() // time UTC value in ms
}
, myObject = [
{content: 'content1', date: '1980-08-01 12:12:40.000', author: 'Person1'},
{content: 'content2', date: '1980-08-01 12:12:40.900', author: 'Person2'},
{content: 'content2', date: '1980-08-01 12:12:41.100', author: 'Person2'},
{content: 'content3', date: '1980-08-01 12:12:41.000', author: 'Person1'},
{content: 'content2', date: '1980-08-01 12:12:41.400', author: 'Person2'},
{content: 'content4', date: '1980-08-01 12:12:45.100', author: 'Person2'},
]
let result =
myObject
.sort( (a,b) =>
a.content.localeCompare(b.content) ||
a.author.localeCompare(b.author) ||
a.date.localeCompare(b.date)
)
.reduce( (r,el,i,{[i-1]:prev}) =>
{
let msTime = getTimeMs(el.date)
if (el.content === prev?.content
&& el.author === prev?.author
&& (msTime - r.msTime) <= 1000 ) // 1 second less on previous
r.current.duplicates ;
else
{
r.current = {...el, duplicates:0 }
r.result.push( r.current )
}
r.msTime = msTime
return r
}
, {msTime:0, current:null, result:[] })
.result;
console.log ( 'result:\n' JSON.stringify( result ).replaceAll('},{','}\n,{') )
.as-console-wrapper {max-height: 100% !important;top: 0;}
.as-console-row::after {display: none !important;}