I have an array element in my DB and I want to group By and calculate the number of repetitions of different elements in this array based on different datetime. assume following collection:
{
_id: ObjectId(7df78ad8902c)
title: 'MongoDB Overview',
tags: ['SQL', 'database', 'NoSQL'],
created_at: 2021-10-03 10:05:51.755Z
},
{
_id: ObjectId(7df78ad8902d)
title: 'NoSQL Overview',
tags: ['mongodb', 'database', 'PHP'],
created_at: 2021-10-03 14:05:51.755Z
},
{
_id: ObjectId(7df78ad8902d)
title: 'Developing',
tags: ['java', 'btc/usdt', 'PHP'],
created_at: 2021-10-03 14:05:51.755Z
}
,
{
_id: ObjectId(7df78ad8902d)
title: 'databases for search',
tags: ['elasticsearch', 'database', 'PHP'],
created_at: 2021-10-03 12:05:51.755Z
}
I want to calculate the number of repetitions of different elements in tags field such as mongodb, database, noSQL based on datetime (for example hot hashtags in last hour, today or this month) in this collection. How can I solve this problem in mongo?
expected answer like .
1 - hot hashtags in last hour ['a' , 'b' , 'c']
2 - hot hastags in last 5 hours : ...
3 - today : ...
4 - this month : ...
CodePudding user response:
Query
- not calendar based, based on difference on milliseconds
- keep only last 30 days data
- facet and 4 group by
- its exactly the same code 4x
The only difference is the multiply
- last 30days :
(now_date-created_at) <= (* 30 24 60 60 1000)
- last 24h :
(now_date-created_at) <= (* 24 60 60 1000)
- last 120 hours(5days) :
(now_date-created_at) <= (* 5 60 60 1000)
- last 60 min :
(now_date-created_at) <= (* 60 60 1000)
*subtraction works on dates also, and returns milliseconds
- last 30days :
- filter the date depending on what we want its 4 different filters
- unwind
- group by tag and count occurences
- sort by count
- limit 2 to keep like the 2 top hotttest tags, you can change it to any value, like limit 1 to keep only the hottest tag
*if you want calendar based like 3 October(3 days data only), query must be changed, the same is for day(query is for 24 hours) etc (in those cases we should use $hour
$month
etc)
Its not big change in query.
//same month
{"$eq" : [ {"$month" : "$$NOW"}, {"$month" : "$created_at"} ]
//same day(assuming same month from previous filter)
{"$eq" : [ {"$dayOfMonth" : "$$NOW"}, {"$dayOfMonth" : "$created_at"} ]
//same hour
*we could also use the new $dateTrunc
to check for same month etc.
(Query is big but its the same thing 4x)
db.collection.aggregate([
{
"$set": {
"created_at": {
"$dateFromString": {
"dateString": "$created_at"
}
}
}
},
{
"$unwind": {
"path": "$tags"
}
},
{
"$match": {
"$expr": {
"$lte": [
{
"$subtract": [
"$$NOW",
"$created_at"
]
},
{
"$multiply": [
30,
24,
60,
60,
1000
]
}
]
}
}
},
{
"$facet": {
"month-tag": [
{
"$match": {
"$expr": {
"$lte": [
{
"$subtract": [
"$$NOW",
"$created_at"
]
},
{
"$multiply": [
30,
24,
60,
60,
1000
]
}
]
}
}
},
{
"$group": {
"_id": "$tags",
"count": {
"$sum": 1
}
}
},
{
"$sort": {
"count": -1
}
},
{
"$limit": 2
},
{
"$project": {
"_id": 0,
"tag": "$_id"
}
}
],
"day-tag": [
{
"$match": {
"$expr": {
"$lte": [
{
"$subtract": [
"$$NOW",
"$created_at"
]
},
{
"$multiply": [
24,
60,
60,
1000
]
}
]
}
}
},
{
"$group": {
"_id": "$tags",
"count": {
"$sum": 1
}
}
},
{
"$sort": {
"count": -1
}
},
{
"$limit": 2
},
{
"$project": {
"_id": 0,
"tag": "$_id"
}
}
],
"5hour-tag": [
{
"$match": {
"$expr": {
"$lte": [
{
"$subtract": [
"$$NOW",
"$created_at"
]
},
{
"$multiply": [
5,
60,
60,
1000
]
}
]
}
}
},
{
"$group": {
"_id": "$tags",
"count": {
"$sum": 1
}
}
},
{
"$sort": {
"count": -1
}
},
{
"$limit": 2
},
{
"$project": {
"_id": 0,
"tag": "$_id"
}
}
],
"hour-tag": [
{
"$match": {
"$expr": {
"$lte": [
{
"$subtract": [
"$$NOW",
"$created_at"
]
},
{
"$multiply": [
60,
60,
1000
]
}
]
}
}
},
{
"$group": {
"_id": "$tags",
"count": {
"$sum": 1
}
}
},
{
"$sort": {
"count": -1
}
},
{
"$limit": 2
},
{
"$project": {
"_id": 0,
"tag": "$_id"
}
}
]
}
}
])