Home > Net >  How to find intersection of values that meet multiple conditions in R?
How to find intersection of values that meet multiple conditions in R?

Time:12-04

Assume I have the following data frame:

df <- data.frame(year=c(2010,2011,2012,2010,2011,2010,2011,2012), company = c("a","a","a","b","b","c","c","c"))

  year company
1 2010       a
2 2011       a
3 2012       a
4 2010       b
5 2011       b
6 2010       c
7 2011       c
8 2012       c

I want to find the companies that are present in all three years. One cumbersome approach would be:

library(dplyr)

companies_2010 <- df %>% filter(year==2010) %>% select(company)
companies_2011 <- df %>% filter(year==2011) %>% select(company)
companies_2012 <- df %>% filter(year==2012) %>% select(company)

companies <- intersect(companies_2010, companies_2011) %>% intersect(., companies_2012)

  company
1       a
2       c

Is there any more elegant way to do this?

CodePudding user response:

Since the years are distinct and in the desired set we only have to count how many there are for each company. (If that is not true, in general, then apply the solutions below to df2 <- unique(merge(df, data.frame(year = 2010:2012))) in place of df. Also if we did not know the value 3 and we wanted it to equal the number of unique years in the data then we could compute it using length(unique(df$year)).

Now, using that idea here are several alternatives. We can use table to get their frequencies and then keep those with frequency 3 or in the last case we can use dplyr's count and then filter to get those with a count of 3.

tab <- table(df$company)
names(tab)[tab == 3]
## [1] "a" "c"

names(Filter(function(x) x == 3, table(df$company)))
## [1] "a" "c"

library(dplyr)   
df %>%
  count(company) %>%
  filter(n == 3) %>%
  select(company)
##   company
## 1       a
## 2       c

To use the intersect idea of the question split company by year and then use Reduce to apply intersect repeatedly:

 with(df, Reduce(intersect, split(company, year)))
 ## [1] "a" "c"

CodePudding user response:

Here is a base R solution with ave and unique.

n <- with(df, ave(year, company, FUN = length))
unique(df$company[n == 3])
#[1] "a" "c"

CodePudding user response:

Expounding on the answers given, this here would work in general:

df %>%
  mutate(rn = list(seq(min(year), max(year))))%>%
  group_by(company) %>%
  summarise(rn = all(unlist(rn) %in% year)) %>%
  filter(rn) %>%
  select(company)

# A tibble: 2 x 1
  company
  <chr>  
1 a      
2 c  

CodePudding user response:

This won't work in general to compute arbitrary intersections, but (¿ I think ?) does what you specified above:

(df 
   %>% group_by(company)
   %>% filter(all(2010:2012 %in% year))
   %>% select(company)
   %>% distinct()
)

CodePudding user response:

Just nest and reduce:

df <- data.frame(year=c(2010,2011,2012,2010,2011,2010,2011,2012), company = c("a","a","a","b","b","c","c","c"))
df %>% 
    tidyr::nest(data = -year) %>% 
    magrittr::use_series(data) %>% 
    purrr::reduce(dplyr::intersect)
# A tibble: 2 x 1
  company
  <chr>  
1 a      
2 c 

Or split-map-reduce:

split.data.frame(df, df$year) %>% 
    purrr::map(magrittr::use_series, company) %>% 
    purrr::reduce(dplyr::intersect)
[1] "a" "c"

CodePudding user response:

Since you have received some excellent answers from a number of inspirational contributors so far, I thought an unusual approach with a little bit of imagination might not do harm now that my friend Thomas is not here.

I visualized your problem set up as a bipartite graph where we have 2 distinct set of nodes where one is company names and the other is the years, while there is no connection between companies and also between year.

library(igraph)

# Creating a graph object but first I alternate the columns of your data set
df[, c(2, 1)] |>
  graph_from_data_frame() -> g

# Then we create a type object to distinguish between 2 sets of nodes, Type FALSE
# refers to company name and type TRUE refers to years
V(g)$type <- bipartite.mapping(g)$type

# Then we extract those nodes whose degree are equal to 3 while they are of type FALSE
V(g)[degree(g, V(g)) == 3 & V(g)$type == FALSE] 
  2/6 vertices, named, from c172916:
[1] a c

In case you would like to know how the graph looks like:

plot(g,
     vertex.color = ifelse(V(g)$type, "lightblue", "salmon"),
     vertex.shape = ifelse(V(g)$type, "circle", "square"),
     vertex.size = 25,
     edge.color = "grey",
     layout = layout.bipartite)

enter image description here

  •  Tags:  
  • r
  • Related