I have three datasets from three different sub-reddits, and my goal is 1-to check how many users are active in df1 (i.e. a sub-reddit), active in df2, and/or df3 (i.e. another subreddit). Another goal is that once I merge all datasets, I am able to know in which sub-reddit was each user's post written at. For example, I would be interested in knowing if user X is active in sub-reddits 2 and 3, but not 1. And user Y is active in sub-reddits 1 and 3 but not 2.
In each dataset, I have 3 variables shown below:
post date username
Here is a sample of df1
post date username
xyz 1-03-2016 crashbash
mnz 1-03-2016 crashbash
mnc 1-03-2016 crashbash
Here is a sample of df2
post date username
yzh 1-05-2016 crashbash
wzh 1-05-2016 costanza89
zya 1-05-2016 costanza89
Here is a sample of df3
post date username
Fleabag is bad 1-05-2016 costanza89
southpark is the bestt! 1-08-2016 crashbash
fleabag is ok 1-08-2016 skunk49
Here is my code:
#Clearing out environment
rm(list = ls())
#Loading packages
library(tidyverse)
library(readxl)
library(writexl)
library(quanteda)
library(stringr)
library(textclean)
library(lubridate)
library(zoo)
## importing 3 datasets
df1 <- read_excel("df1.xlsx")
df2 <- read_excel("df2.xlsx")
df3 <- read_excel("df3.xlsx")
I currently wrote the code below, which works well but it only tells me if a given user has more than one post in a given sub-reddit, but it does not make a distinction between users who have multiple posts within one sub-reddit, versus those who are active in more than one sub-reddit. I am mainly interested in learning the latter group.
all_subreddits <-
bind_rows(df1,df2,df3,.id = "origin") %>%
group_by(username) %>%
mutate(active = (n_distinct(origin) == 2), .keep = "unused")
After the code above, the data looks as follows, where active= 1 if a user appears more than once and 0 otherwise.
sapply(all_subreddits, class)
post date username active
"character" "character" "character" "integer"
Ideally, however, I would like to have the following outcome with a variable indicating the sub-reddit where each user has been active in:
post date username active
xyz 1-03-2016 crashbash in df1 & df2
zya 1-05-2016 costanza89 in df1 and df3
fleabag is ok 1-08-2016 skunk49 in df3
After running the great solution proposed below, I get the following output:
sapply(all_subreddits, class)
origin post date username
"character" "character" "Date" "character"
print(all_subreddits)
A tibble: 1,037 x 4
origin post date username
<chr> <chr> <date> <chr>
748
df2
الشكوى لله ذلونا صراحه
27-09-2012
هتلر المخاريم
678
df2
اقتباس: المشاركة الأصلية كتبت بواسطة حظها العاثر (المشاركة 6775851) ^ والله صادقه يا اختي حسبي الله ونعم الوكيل انا واختي الشئ نفسه غير مؤهلين عشان راتب بابا التقاعدي الله يرحمه والله ظلم :( حسبي الله عليهم انا وخواتي مثلك يارب ياخذ حقنا منهم بالدنيا قبل الآخرة