Home > other >  Adding count of children per parent in single data frame and count of all descendants
Adding count of children per parent in single data frame and count of all descendants

Time:12-03

Have a table of employees (key emp id) with one column being the boss (same key format). I was able to build out the hierarchy by doing repeated joins (took 10 for my real organization), but now I need to add a column for:

  • employee_count (count of rows with the boss_id = emp_id of current row)
  • empire_count (count of rows with the hierarchy starting the same way)

I'm most comfortable in the Tidyverse, here's some simplified fake data of what I have so far as example:

library(tidyverse)

employees = tibble(
  emp_id = c(1,2,3,4,5,6,7),
  emp_name = c('BigBoss','MedBoss','MedBoss2','Emp1','Emp2','Emp3','Emp4'),
  boss_name = c('','BigBoss','BigBoss','MedBoss','MedBoss','MedBoss2','MedBoss2'),
  hierarchy = c('','BigBoss','BigBoss','BigBoss>MedBoss','BigBoss>MedBoss','BigBoss>MedBoss2','BigBoss>MedBoss2')
)

Which looks like this:

# A tibble: 7 × 4
  emp_id emp_name boss_name  hierarchy         
   <dbl> <chr>    <chr>      <chr>             
1      1 BigBoss  ""         ""                
2      2 MedBoss  "Bigboss"  "BigBoss"         
3      3 MedBoss2 "BigBoss"  "BigBoss"         
4      4 Emp1     "MedBoss"  "BigBoss>MedBoss" 
5      5 Emp2     "MedBoss"  "BigBoss>MedBoss" 
6      6 Emp3     "MedBoss2" "BigBoss>MedBoss2"
7      7 Emp4     "MedBoss2" "BigBoss>MedBoss2"

As far as what I'm looking for, employee_count should be 2 for each of the MedBosses and BigBoss, and then the empire_count for BigBoss would be 6.

For the employee_count piece, I could separately do a:

data %>% group_by(boss_name) %>% summarize(employee_count=n(emp_id))

and then join it back, but then the hierarchy wouldn't work the same way... I think the answer is some map function from purrr or creating a function, and Vectorize()'ing it and calling within a mutate, but that hasn't worked for me.

This is as close as I can get...

# Function to get the count of employees for a boss_name
get_employee_count = function(table,bossname) table %>% filter(boss_name==bossname) %>% nrow()

# This call works (returns 2)
get_employee_count(employees,'BigBoss')

# Try to add the count in via a mutate (returns 7 for each)
employees %>% 
  mutate(employee_count=get_employee_count(.,boss_name))

If I can get that to work, I think I could figure out the harder piece as I could do it also as a function.

CodePudding user response:

After a lot of trial and error, I can do it if I hard-code the dataframe name, which is close enough for me.

# This is the solution
employees %>% 
  mutate(employee_count = map_int(emp_name,function(name) employees %>% filter(boss_name==name) %>% nrow())) %>% 
  mutate(empire_count = map_int(emp_name,function(name) employees %>% filter(str_detect(hierarchy,name)) %>% nrow()))

CodePudding user response:

Definitely a graph/tree problem. For employee count, just find all the adjacent nodes. For empire_count, enumerate all the simple paths and count the unique nodes. All easy enough with igraph

library(igraph)

empTree <- graph_from_data_frame(
  employees |> 
    filter(boss_name != "") |> 
    select(from = emp_name, to = boss_name)
)

Count_Empire <- function(node) {
  
  paths <- all_simple_paths(empTree, node, mode = "in") 
  
  if(length(paths) == 0) return(0)
  
  ct <- paths |> 
    map(~as.vector(.x)) |> 
    reduce(c) |> 
    unique() |> 
    length()
  
  # minus 1 for self 
  ct - 1
}

Count_Employees <- function(node) {
  
  adjacent_vertices(empTree, node, mode = "in")[[1]] |> 
    as.vector() |> 
    length()
}

employees |> 
  mutate(
    employee_count = map_dbl(emp_name, Count_Employees),
    empire_count = map_dbl(emp_name, Count_Empire)
  )
  • Related