There are some tweet data like:
{
"id": 1432568863334539264,
"created_at": "2021-08-31 05:00:07",
"text": "Most of New Orleans went dark on Sunday after Hurricane Ida took out eight transmission lines. A new natural gas po… ",
"user_id": 807095
},
{
"id": 1432578972844531714,
"created_at": "2021-08-31 05:40:17",
"text": "RT @nytimes: Most of New Orleans went dark on Sunday after Hurricane Ida took out eight transmission lines. A new natural gas power plant,…",
"user_id": 1414848022849245186,
"retweet_id": 1432568863334539264,
"retweet_user_id": 807095,
"user_mentions": [
{
"id": 807095,
"indices": [
3,
11
]
}
]
},
{
"id": 1432578589107625990,
"created_at": "2021-08-31 05:38:46",
"text": "RT @nytimes: Most of New Orleans went dark on Sunday after Hurricane Ida took out eight transmission lines. A new natural gas power plant,…",
"user_id": 1345847262010875915,
"retweet_id": 1432568863334539264,
"retweet_user_id": 807095,
"user_mentions": [
{
"id": 807095,
"indices": [
3,
11
]
}
]
}
I want to use a single find or aggregate to resolve:
Find the general tweet that receives most retweets in the first hour after it is published. Print out the tweet Id and the number of retweets it received within the first hour.
my did not work script:
db.tweets_hurricane.aggregate([
{$match: {retweet_id:{$exists:true}}},
{$project: {_id: 0, id: '$id', retweet_id: '$retweet_id', format_date: { $toDate: '$created_at'}, format_date_end: { $add: [{$toDate: '$created_at'}, 1*60*60*1000]}}},
{$group: {_id: '$retweet_id', count: {$sum: {$switch: {branches: [{case: {$lt: ['$format_date', db.tweets_hurricane.findOne({id: '$retweet_id'}, {format_date_end:1})]}, then: 1}], default: 0}}}}},
{$sort: {count: -1}}
])
CodePudding user response:
The sample data makes it difficult to understand the solution, but it should work:
db.tweets_hurricane.aggregate([
// Join tweets and re-tweets
{ $match: { retweet_id: { $exists: false } } },
{
$lookup:
{
from: "tweets_hurricane",
localField: "id",
foreignField: "retweet_id",
as: "retweets"
}
},
// filter retweets created in the first hour after it is published
{
$set: {
retweets: {
$filter: {
input: "$retweets",
as: "tweet",
cond: { $lt: ["$$tweet.created_at", { $add: ["$created_at", 1000 * 60 * 60] }] }
}
}
}
},
// count number of retweets
{ $set: { retweet_count: { $size: "$retweets" } } },
// limit to most retweeted tweet
{ $sort: { retweet_count: -1 } },
{ $limit: 1 },
// finalize output
{
$project: {
id: 1,
retweet_count: 1
}
}
])
CodePudding user response:
Sorry it took so long. I had to generate dummy data to for testing the query. The query is very long but it's the most efficient one I that could come up with.
Note: I used the $lookup
to find all the re-tweets. But the version of mongodb installed on my machine is v4.4
. So I had to use an older syntax. Check out the docs for $lookup
to use the concise syntax for mongodb v => 5.0
.
Algorithm
1 - Find all the general tweets 2 - For each general tweet find all it's retweets by checking if - general_tweet_id === retweet_id and the difference of milliseconds between - the retweet and general tweet's - creation time is < 60*60*1000 (1 hour ) 3 - Count the retweets array size in the `$project` statement as num_of_retweets 4 - Sort num_of_retweets in descending order. 5 - use `$limit` to pick the first result.
Dummy Data
const data = [
{ _id: "200", created_at: "2021-01-02 11:25:00" },
{ _id: "201", created_at: "2021-01-02 12:19:00", retweet_id: "200" },
{ _id: "202", created_at: "2021-01-02 11:59:00", retweet_id: "200" },
{ _id: "203", created_at: "2021-01-02 12:19:00", retweet_id: "200" },
{ _id: "204", created_at: "2021-01-02 12:17:00", retweet_id: "200" },
{ _id: "205", created_at: "2021-01-02 11:57:00", retweet_id: "200" },
{ _id: "206", created_at: "2021-01-02 12:17:00", retweet_id: "200" },
{ _id: "207", created_at: "2021-01-02 12:16:00", retweet_id: "200" },
{ _id: "208", created_at: "2021-01-02 13:45:00", retweet_id: "200" },
{ _id: "209", created_at: "2021-01-02 13:29:00", retweet_id: "200" },
{ _id: "210", created_at: "2021-01-02 13:22:00", retweet_id: "200" },
{ _id: "211", created_at: "2021-01-01 05:28:00" },
{ _id: "212", created_at: "2021-01-01 05:59:00", retweet_id: "211" },
{ _id: "213", created_at: "2021-01-01 06:05:00", retweet_id: "211" },
{ _id: "214", created_at: "2021-01-01 06:11:00", retweet_id: "211" },
{ _id: "215", created_at: "2021-01-01 06:16:00", retweet_id: "211" },
{ _id: "216", created_at: "2021-01-01 06:00:00", retweet_id: "211" },
{ _id: "217", created_at: "2021-01-01 06:22:00", retweet_id: "211" },
{ _id: "218", created_at: "2021-01-01 07:09:00", retweet_id: "211" },
{ _id: "219", created_at: "2021-01-01 07:41:00", retweet_id: "211" },
{ _id: "220", created_at: "2021-01-01 06:54:00", retweet_id: "211" },
{ _id: "221", created_at: "2021-01-01 07:09:00", retweet_id: "211" },
{ _id: "222", created_at: "2021-01-02 14:22:00" },
{ _id: "223", created_at: "2021-01-02 14:56:00", retweet_id: "222" },
{ _id: "224", created_at: "2021-01-02 15:00:00", retweet_id: "222" },
{ _id: "225", created_at: "2021-01-02 15:00:00", retweet_id: "222" },
{ _id: "226", created_at: "2021-01-02 15:01:00", retweet_id: "222" },
{ _id: "227", created_at: "2021-01-02 14:56:00", retweet_id: "222" },
{ _id: "228", created_at: "2021-01-02 15:44:00", retweet_id: "222" },
{ _id: "229", created_at: "2021-01-02 16:51:00", retweet_id: "222" },
{ _id: "230", created_at: "2021-01-02 16:17:00", retweet_id: "222" },
{ _id: "231", created_at: "2021-01-02 16:42:00", retweet_id: "222" },
{ _id: "232", created_at: "2021-01-02 16:29:00", retweet_id: "222" },
{ _id: "233", created_at: "2021-01-01 03:40:00" },
{ _id: "234", created_at: "2021-01-01 04:22:00", retweet_id: "233" },
{ _id: "235", created_at: "2021-01-01 04:31:00", retweet_id: "233" },
{ _id: "236", created_at: "2021-01-01 04:24:00", retweet_id: "233" },
{ _id: "237", created_at: "2021-01-01 04:33:00", retweet_id: "233" },
{ _id: "238", created_at: "2021-01-01 06:06:00", retweet_id: "233" },
{ _id: "239", created_at: "2021-01-01 04:56:00", retweet_id: "233" },
{ _id: "240", created_at: "2021-01-01 04:53:00", retweet_id: "233" },
{ _id: "241", created_at: "2021-01-01 04:52:00", retweet_id: "233" },
{ _id: "242", created_at: "2021-01-01 05:25:00", retweet_id: "233" },
{ _id: "243", created_at: "2021-01-01 05:17:00", retweet_id: "233" },
]
Aggregation Pipeline
Note: I put all this data into a collection called test
that's why you see "test"
in the from
field of the $lookup
statement. Change it to your own collection name.
pipeline = [
{
$match: {
retweet_id: { $exists: false },
},
},
{
$lookup: {
from: "test",
let: { general_tweet_date: "$created_at", general_tweet_id: "$_id" },
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ["$retweet_id", "$$general_tweet_id"] },
{
$lte: [
{
$subtract: [
{ $toDate: "$created_at" },
{ $toDate: "$$general_tweet_date" },
],
},
3600000,
],
},
],
},
},
},
{
$project: {
_id: 1,
},
},
],
as: "retweets",
},
},
{
$project: {
num_of_retweets: { $size: "$retweets" },
},
},
{
$sort: { num_of_retweets: -1 },
},
{
$limit: 1,
},
];