I am trying to converge my data by organization ids. To do so, I need to find how many different domains there are for each id (excluding NAs). I can't figure out how to do it in R using dplyr package. Any help is appreciated.
This is something similar to what my data looks like.
id | domain |
---|---|
1 | ca |
1 | ca |
1 | com |
1 | na |
1 | com |
2 | org |
2 | na |
2 | ca |
3 | org |
3 | ca |
3 | com |
3 | net |
What I want is to get the following output:
id | count_domain |
---|---|
1 | 2 |
2 | 2 |
3 | 4 |
I don't want to do this in excel, but if that's the more efficient approach, I am also open to answers in excel.
PS: My actual data contains much more rows and types of domain, so writing 'if statements' is not something that would be feasible.
CodePudding user response:
I assume that your data has NA
and not literal string "na"
, then you could try
library(tidyverse)
df <- read_table("id domain
1 ca
1 ca
1 com
1 NA
1 com
2 org
2 NA
2 ca
3 org
3 ca
3 com
3 net")
df %>%
group_by(id) %>%
summarise(count_domain = sum(!is.na(unique(domain))))
# A tibble: 3 x 2
id count_domain
<dbl> <int>
1 1 2
2 2 2
3 3 4
CodePudding user response:
For completeness, this can be achieved with the following solution in Excel (assuming that your data is stored in the range A:B
; and na should not count according to your sample data).
First, enter the following formula in, e.g., cell D2
, to create the unique set of IDs (this could also be specified more dynamically, if needed for e.g., varying lengths of the data):
=UNIQUE(A2:A13)
To compute the number of domains, enter the following formula in cell E2
:
=SUM(--(LEN(UNIQUE(FILTER(B:B,(A:A=D2)*(B:B<>"na"),"")))>0))