Home > database >  How to compare rows by multiple columns in R?
How to compare rows by multiple columns in R?

Time:07-01

I have a data set that I am trying to clean without accidentally losing information. The data has two identifying pieces of information (each in a column): name and ID number. However, there are duplicates, but I can't just remove them based on name or ID alone. Some ID numbers have variations on the name associated with them, and some forms of names show up across different ID numbers (but they are NOT the same).

Here is an example of the data that I have:

data<- data.frame(stringsAsFactors = FALSE,
                  A = c("dog", "dog", "cat", "cat", "bird", "big bird", "bird", "dog", "red dog"),
                  B = c(111, 111, 222, 222, 333, 333, 333, 444, 444),
                  C = c("1", "1", "2", "2", "3", "3", "3", "4", "4"))
############################################################################################
         A   B C
1      dog 111 1
2      dog 111 1 #this row is redundant
3      cat 222 2
4      cat 222 2 #this row is redundant
5     bird 333 3
6 big bird 333 3
7     bird 333 3 #this row is redundant w/row 5
8      dog 444 4
9  red dog 444 4

But here is how I want it to appear:

ideal<- data.frame(stringsAsFactors = FALSE,
                   A = c("dog", "cat", "bird", "big bird", "dog", "red dog"),
                   B = c("111", "222", "333", "333", "444", "444"),
                   C = c("1", "2", "3", "3", "4", "4")
                   )

         A   B C
1      dog 111 1
2      cat 222 2
3     bird 333 3
4 big bird 333 3
5      dog 444 4
6  red dog 444 4

It seems like I have 2 cases that are confusing me. First, as exemplified by ID 333, there are some cases where I want to keep some info for manual checking ("bird" and "big bird") but I don't need the duplicate rows of 333 that are identical. Second, as exemplified by ID 444, there are some cases where the name matches another ID (111), but those are NOT the same, which prevents me from just getting rid of duplicate name rows altogether. I will have to manually check the data set still once I get to my ideal outcome, but it would make the task substantially easier if I could get rid of the rows that are clean cases like IDs 111 and 222.

CodePudding user response:

Are you simply looking for unique rows? I'm not sure I understand, but if you simply do this:

unique(data)

or

dplyr::distinct(data)

you will get the result you show above

         A   B C
1      dog 111 1
3      cat 222 2
5     bird 333 3
6 big bird 333 3
8      dog 444 4
9  red dog 444 4
  • Related