I have a dataset of chat messages from chatrooms. I need to filter out all chatrooms in which only one person wrote something in the chat (even if that person wrote multiple things). So in the example dataset below, I need to eliminate Chatrooms 1, 6, and 8.
data.table(Chatroom = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8), Person = c("A","A", "B","C","D","E","F","G","H","I","J","J","J","K","L","M", "M"), Message = c("Hi", "You there?", "Hello", "Hi", "Hey", "Howdy", "Hi", "Hey", "Greetings", "Hi", "Hi", "Hello?", "Anyone there?", "Hey", "Hi", "Hello?", "Helllooooooo?"))
Chatroom Person Message
1: 1 A Hi
2: 1 A You there?
3: 2 B Hello
4: 2 C Hi
5: 3 D Hey
6: 3 E Howdy
7: 4 F Hi
8: 4 G Hey
9: 5 H Greetings
10: 5 I Hi
11: 6 J Hi
12: 6 J Hello?
13: 6 J Anyone there?
14: 7 K Hey
15: 7 L Hi
16: 8 M Hello?
17: 8 M Helllooooooo?
Obviously, this can be done manually, but I've tons of data to filter.
Is there a way to do this with one or more scripts in R?
I am imagining needing a script that can identify and save the list of chatrooms that contain only one person and then another script to remove the Chatrooms from that list, but I don't know which functions can accomplish this.
Help?
CodePudding user response:
This can easily be done with a filter function. First, assign your dataframe a name. From there, you can pipe (%>%) a group_by and a filter. Make sure you include the ! in the filter.
df <- data.frame(Chatroom = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8), Person = c("A","A", "B","C","D","E","F","G","H","I","J","J","J","K","L","M", "M"), Message = c("Hi", "You there?", "Hello", "Hi", "Hey", "Howdy", "Hi", "Hey", "Greetings", "Hi", "Hi", "Hello?", "Anyone there?", "Hey", "Hi", "Hello?", "Helllooooooo?"))
final <- df %>% group_by(Person) %>% filter(!n()>1)
final
CodePudding user response:
There are a number of options. My first try was to use uniqueN(Person)>1
in .SD
, by Chatroom
:
df[, .SD[uniqueN(Person)>1], Chatroom]
Some possibly slightly faster options:
df[, ct:=uniqueN(Person), Chatroom][ct>1][,ct:=NULL]
OR
df[, ct:=length(unique(Person)), Chatroom][ct>1][,ct:=NULL]
OR
df[, ct:=max(rleid(Person)), Chatroom][ct>1][,ct:=NULL]