Home > Back-end >  Tracking observations across different datasets (e.g. sub-reddits)
Tracking observations across different datasets (e.g. sub-reddits)

Time:05-08

I have three datasets from three different sub-reddits, and my goal is 1-to check how many users are active in df1 (i.e. a sub-reddit), active in df2, and/or df3 (i.e. another subreddit). Another goal is that once I merge all datasets, I am able to know in which sub-reddit was each user's post written at. For example, I would be interested in knowing if user X is active in sub-reddits 2 and 3, but not 1. And user Y is active in sub-reddits 1 and 3 but not 2.

In each dataset, I have 3 variables shown below:

post   date        username

Here is a sample of df1

post   date          username
xyz    1-03-2016     crashbash
mnz   1-03-2016      crashbash
mnc   1-03-2016      crashbash

Here is a sample of df2

post   date          username
yzh  1-05-2016      crashbash
wzh  1-05-2016      costanza89
zya  1-05-2016      costanza89

Here is a sample of df3

post   date             username
Fleabag is bad          1-05-2016      costanza89
southpark is the bestt!  1-08-2016      crashbash
fleabag is ok           1-08-2016      skunk49

Here is my code:

#Clearing out environment
rm(list = ls())
#Loading packages
library(tidyverse)
library(readxl)
library(writexl)
library(quanteda)
library(stringr)
library(textclean)
library(lubridate)
library(zoo)
## importing 3 datasets
df1 <- read_excel("df1.xlsx")
df2 <- read_excel("df2.xlsx")
df3 <- read_excel("df3.xlsx")

I currently wrote the code below, which works well but it only tells me if a given user has more than one post in a given sub-reddit, but it does not make a distinction between users who have multiple posts within one sub-reddit, versus those who are active in more than one sub-reddit. I am mainly interested in learning the latter group.

all_subreddits <-
  bind_rows(df1,df2,df3,.id = "origin") %>% 
  group_by(username) %>% 
  mutate(active =  (n_distinct(origin) == 2), .keep = "unused")

After the code above, the data looks as follows, where active= 1 if a user appears more than once and 0 otherwise.

sapply(all_subreddits, class)
       post        date    username      active 
"character" "character" "character"   "integer" 

Ideally, however, I would like to have the following outcome with a variable indicating the sub-reddit where each user has been active in:

post              date           username               active
xyz               1-03-2016     crashbash         in df1 & df2
zya               1-05-2016      costanza89       in df1 and df3
fleabag is ok     1-08-2016      skunk49          in df3

After running the great solution proposed below, I get the following output:

sapply(all_subreddits, class)
 origin        post        date    username 
"character" "character"      "Date" "character"
print(all_subreddits)
A tibble: 1,037 x 4
   origin post                                                               date   username    
   <chr>  <chr>                                                              <date> <chr>       
748
df2
الشكوى لله ذلونا صراحه
27-09-2012
هتلر المخاريم
678
df2
اقتباس: المشاركة الأصلية كتبت بواسطة حظها العاثر (المشاركة 6775851) ^ والله صادقه يا اختي حسبي الله ونعم الوكيل انا واختي الشئ نفسه غير مؤهلين عشان راتب بابا التقاعدي الله يرحمه والله ظلم :( حسبي الله عليهم انا وخواتي مثلك يارب ياخذ حقنا منهم بالدنيا قبل الآخرة            
  • Related