Home > Software engineering >  Deduping an array of objects with similar datetimes
Deduping an array of objects with similar datetimes

Time:07-05

I am trying to dedupe an array of JSON objects where the content and author are the same, but the timestamp is slightly different (i.e. within 1 second). I'd like to preserve the duplicated messages as a new field, called duplicates. For example, consider the following which has as entries 2,3 and 5 messages which should be deduped :

myObject = [
{content: 'content1', date: '1980-08-01 12:12:40.000', author: 'Person1'}, 
{content: 'content2', date: '1980-08-01 12:12:40.900', author: 'Person2'},
{content: 'content2', date: '1980-08-01 12:12:41.100', author: 'Person2'},
{content: 'content3', date: '1980-08-01 12:12:41.000', author: 'Person1'},
{content: 'content2', date: '1980-08-01 12:12:41.400', author: 'Person2'},
{content: 'content4', date: '1980-08-01 12:12:45.100', author: 'Person2'},
]

should be transformed to :

deduped = [
{content: 'content1', date: '1980-08-01 12:12:40.000', author: 'Person1', duplicates: 0}, 
{content: 'content2', date: '1980-08-01 12:12:40.900', author: 'Person2', duplicates: 2},
{content: 'content3', date: '1980-08-01 12:12:41.000', author: 'Person1', duplicates: 0},
{content: 'content4', date: '1980-08-01 12:12:45.100', author: 'Person2', duplicates: 0},
]

The part that I am having trouble with is the datetime. Sorting by datetime and then reducing is prone to errors if a non duplicate message occurs between the duplicates. And comparing the string value of the datetimes is also error prone because two messages may be very close together, but show as 1 second apart based on where they fall.

Using lodash _.uniqWith, I can dedupe based on the combination of a actual timedelta with identical content and author, but I lack the duplicates field...

const dedupedButNoCount = _.uniqWith(myObject, (item1, item2) => 
{return (item1.content== item2.content) && (item1.author== item2.author) 
&& ((new Date(item1.date).getTime() - new Date(item2.date).getTime())<500)}
)

Any pointers on how to dedupe an array of objects with similar but not identical datetimes?

CodePudding user response:

I've done that, but I use a sort...

const
  getTimeMs = YMDhmsx =>     // date string conversion to UTC (time zone = 0)
    {
    let [Y,M,D,h,m,s,x] = YMDhmsx.split(/\-|\.|\s|\:/).map(Number)
    return (new Date(Date.UTC(Y,--M,D,h,m,s,x))).getTime() // time UTC value in ms
    }
, myObject = [
    {content: 'content1', date: '1980-08-01 12:12:40.000', author: 'Person1'}, 
    {content: 'content2', date: '1980-08-01 12:12:40.900', author: 'Person2'},
    {content: 'content2', date: '1980-08-01 12:12:41.100', author: 'Person2'},
    {content: 'content3', date: '1980-08-01 12:12:41.000', author: 'Person1'},
    {content: 'content2', date: '1980-08-01 12:12:41.400', author: 'Person2'},
    {content: 'content4', date: '1980-08-01 12:12:45.100', author: 'Person2'},
    ]
    
let result = 
  myObject
  .sort( (a,b) =>
    a.content.localeCompare(b.content) || 
    a.author.localeCompare(b.author) || 
    a.date.localeCompare(b.date) 
    )
  .reduce( (r,el,i,{[i-1]:prev}) =>
    {
    let msTime = getTimeMs(el.date)

    if (el.content === prev?.content 
     && el.author === prev?.author
     && (msTime - r.msTime) <= 1000 )  // 1 second less on previous
      r.current.duplicates  ;
    else
      {
      r.current = {...el, duplicates:0 }
      r.result.push( r.current )
      }
    r.msTime = msTime
    return r
    }
    , {msTime:0, current:null, result:[] })
  .result;
  
console.log ( 'result:\n'   JSON.stringify( result ).replaceAll('},{','}\n,{') ) 
.as-console-wrapper {max-height: 100% !important;top: 0;}
.as-console-row::after {display: none !important;}

  • Related