Home > other >  How can I compare records row by row and remove one row that does not meet my condition?
How can I compare records row by row and remove one row that does not meet my condition?

Time:02-01

I have a collection that looks like this:

username timestamp roomkey roomdoor
amy123 30 Jan 2022 18:30:52 retrieved closed
amy123 30 Jan 2022 18:31:30 retrieved opened
bob1 30 Jan 2022 18:32:30 retrieved opened
bob1 30 Jan 2022 20:45:30 retrieved closed
bob1 30 Jan 2022 20:46:15 returned closed

what I want to do is to only return all data but merge those that are within the minute and done by the same user. For example amy123 first retrieved, then opened and these 2 records were added within 1 minute, as such I want to only return the latest record for amy123 within that minute. Also, even though bob1 has a record within the minute, it is a different user so I will show it too.

Final output should look like this:

username timestamp roomkey roomdoor
amy123 30 Jan 2022 18:31:30 retrieved opened
bob1 30 Jan 2022 18:32:30 retrieved opened
bob1 30 Jan 2022 20:46:15 returned closed

How can I code it to do the above in mongodb pipeline or using python json?

My idea is to first check if the previous record is the same username. If it is the same username, then I will check for the difference in time. If the difference is <=1minute, I will return the latest record. If not, I will return all the records. However, I dont know how to code this.

Thanks in advance!

CodePudding user response:

Here is a solution where I group timestamp by 2 Minutes windows with groupby and a Grouper

df = pd.DataFrame(
    [
        {"username": "amy123", "timestamp": "30 Jan 2022 18:30:52", "roomkey": "retrieved", "roomdoor": "closed"},
        {"username": "amy123", "timestamp": "30 Jan 2022 18:31:30", "roomkey": "retrieved", "roomdoor": "opened"},
        {"username": "bob1", "timestamp":   "30 Jan 2022 18:32:30", "roomkey": "retrieved", "roomdoor": "opened"},
        {"username": "bob1", "timestamp":   "30 Jan 2022 20:45:30", "roomkey": "retrieved", "roomdoor": "closed"},
        {"username": "bob1", "timestamp":   "30 Jan 2022 20:46:15", "roomkey": "returned", "roomdoor":  "closed"},
    ]
)
df.timestamp = pd.to_datetime(df.timestamp)

df = df.groupby(["username", pd.Grouper(key="timestamp", freq='2Min')]).last().reset_index()
df.drop_duplicates(keep="last")

Here is the output.

username timestamp roomkey roomdoor
amy123 30 Jan 2022 18:31:30 retrieved opened
bob1 30 Jan 2022 18:32:30 retrieved opened
bob1 30 Jan 2022 20:46:15 returned closed

CodePudding user response:

The question is well-solved by using $reduce as a "for loop with state." Consider this input set:

    {name: "amy", ts: new ISODate("2020-01-01T00:00:20Z"), o1:1, o2:"X1"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:00:30"), o1:2, o2:"X2"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:00:58"), o1:3, o2:"X3"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:01:15"), o1:31, o2:"X31"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:01:30"), o1:32, o2:"X32"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:02:00"), o1:4, o2:"X4"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:02:40"), o1:5, o2:"X5"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:04:00"), o1:65, o2:"X65"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:04:10"), o1:75, o2:"X75"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:20:35"), o1:86, o2:"X86"}
    ,{name: "amy", ts: new ISODate("2020-01-01T00:20:36"), o1:96, o2:"X96"}


    ,{name: "bob", ts: new ISODate("2020-01-01T00:00:30"), o1:7, o2:"X7"}
    ,{name: "bob", ts: new ISODate("2020-01-01T00:01:30"), o1:8, o2:"X8"}
    ,{name: "bob", ts: new ISODate("2020-01-01T00:01:35"), o1:9, o2:"X9"}

o1 and o2 are placeholders for other data and are not directly part of the one minute bucketing scheme. They will be carried along automatically in the solution; any number of other fields of any type can be carried without changing the query.

db.foo.aggregate([
    // If appropriate, start with a $match stage here to cut down the amount
    // of material, especially dates.  You probably do not want 1 min buckets
    // of everything from day 1.  For now, no $match.

    // Ensure everything going in date order....                                                      
    {$sort: {ts:1}},

    // Group by name and push the whole doc (which is in date order) onto                             
    // array 'a':                                                                                     
    {$group: {_id:"$name", a: {$push: "$$CURRENT"}}},                                               
    
    {$addFields: {a: {$let: {
        // Get first element of a in prep for setting init value...                                   
        vars: {sd: {$arrayElemAt:["$a",0]}},
        in: {$reduce: {
            input: "$a",

            initialValue: {
                prev:"$$sd",  // the whole doc                                                        
                last:"$$sd.ts", // last anchor date, e.g. start of 60 interval                        
                accum: []
            },

            in: {$cond: [
                // If the next ts < 60000 millis beyond anchor..                                      
                {$lt:[{$subtract:["$$this.ts", "$$value.last"]}, 60000]},

                // then capture it as prev but let last anchor                                        
                // and accumulated hits carry forward unchanged                                       
                {prev: "$$this",
                 last: "$$value.last", // carry                                                       
                 accum: "$$value.accum" // carry                                                      
                },

                // else capture it AND reset the anchor AND append the                                   
                // previous value to the accum array (because at the this                                   
                // point we "overran" the 60 interval).  Note that 
                // $concatArrays wants arrays, not docs as inputs so we
                // must wrap $$value.prev (which is a $$CURRENT doc) with []:                                                 
                {prev: "$$this",
                 last: "$$this.ts", // reset last to this one                                         
                 accum: {$concatArrays: [ "$$value.accum", ["$$value.prev"] ] }
                }
                ]
            }
        }}
      }}
    }},

    // The algo will always leave the very last value (which is ALWAYS one                            
    // we take) "dangling" in prev, so here we simply do one more concat.                             
    // We also take the oppty to both "lift" 'a.accum' to just 'a' and in so                          
    // doing get rid of 'prev' and 'last':                                                            
    {$addFields: {a: {$concatArrays: [ "$a.accum", [ "$a.prev" ]]} }}
]);

This yields:

{
    "_id" : "amy",
    "a" : [
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2cb"),
            "name" : "amy",
            "ts" : ISODate("2020-01-01T00:01:15Z"),
            "o1" : 31,
            "o2" : "X31"
        },
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2cd"),
            "name" : "amy",
            "ts" : ISODate("2020-01-01T00:02:00Z"),
            "o1" : 4,
            "o2" : "X4"
        },
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2ce"),
            "name" : "amy",
            "ts" : ISODate("2020-01-01T00:02:40Z"),
            "o1" : 5,
            "o2" : "X5"
        },
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2d0"),
            "name" : "amy",
            "ts" : ISODate("2020-01-01T00:04:10Z"),
            "o1" : 75,
            "o2" : "X75"
        },
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2d2"),
            "name" : "amy",
            "ts" : ISODate("2020-01-01T00:20:36Z"),
            "o1" : 96,
            "o2" : "X96"
        }
    ]
}
{
    "_id" : "bob",
    "a" : [
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2d3"),
            "name" : "bob",
            "ts" : ISODate("2020-01-01T00:00:30Z"),
            "o1" : 7,
            "o2" : "X7"
        },
        {
            "_id" : ObjectId("61f7e34ba565bb368b38e2d5"),
            "name" : "bob",
            "ts" : ISODate("2020-01-01T00:01:35Z"),
            "o1" : 9,
            "o2" : "X9"
        }
    ]
}

If you do not want or need a generic solution, then instead of pushing $$CURRENT you can push a subset doc, e.g.

    {$group: {_id:"$name", a: {$push: {ts: "$ts", o1: "$o1"}} }},

The rest of the pipeline is unchanged -- but you must always include field ts in the a array to properly drive the logic.

  •  Tags:  
  • Related