Loop over df and retrieve unique values linked to unique values in other column-CodePudding

I have subcategorized and categorized labels in excel but I want to make it reproducable so I want to convert it into R code.

I have a df containing 631 rows of which the first 15 rows look like this.

   IV_label               Subcategory            Category                         
   <chr>                  <chr>                  <chr>                            
 1 light conditions       time of day            exogenous                        
 2 vital status           victim characteristics human involvement 
 3 road type              road type              exogenous                        
 4 reserve density        workload               police discretion                
 5 road type              road type              exogenous                        
 6 surface type           road type              exogenous                        
 7 surface characteristic road type              exogenous                        
 8 light conditions       time of day            exogenous                        
 9 light conditions       time of day            exogenous                        
10 weather                weather type           exogenous                        
11 weather                weather type           exogenous                        
12 weather                weather type           exogenous                        
13 day of the week        day of the week        exogenous                        
14 amount of lanes        road type              exogenous                        
15 amount of lanes        road type              exogenous

I want to be able to add the following to my R code without having to construct the lists myself:

time of day                 <- list(light conditions, ...)
victim characteristics      <- list(vital status, ...)
road type                   <- list(road type, surface type, surface characteristics, amount of lanes, ...) (# notice road type is include only once!)
workload                    <- list(reserve density, ...)
weather type                <- list(weather, ...)
day of the week             <- list(day of the week, ...)
exogenous                   <- list(time of day, road type, weather type, day of the week)
human involvement           <- list(victim characteristics)
police discretion           <- list(workload)

I understand that I will need to boilerplate this part myself:

time of day                 <- list(
victim characteristics      <- list(
road type                   <- list(
workload                    <- list(
weather type                <- list(
day of the week             <- list(
exogenous                   <- list(
human involvement           <- list(
police discretion           <- list(

But I hope to be able to copy the unique values from the console and just past the them into the above boilerplate.

CodePudding user response：

Here I am considering an edge any pair of terms appearing in the same row, in two consecutive columns. I am using the adjacency matrix adj to keep track of the edges and then reconstruct the graph as a named list:

library(purrr)

df <- data.frame(IV_label = c(
                   "light conditions","vital status","road type",
                   "reserve density","road type","surface type",
                   "surface characteristic","light conditions","light conditions",
                   "weather","weather","weather",
                   "day of the week","amount of lanes","amount of lanes"),
                 Subcategory = c(
                   "time of day","victim characteristics","road type",
                   "workload","road type","road type",
                   "road type","time of day","time of day",
                   "weather type","weather type","weather type",
                   "day of the week","road type","road type"),
                 Category = c(
                   "exogenous","human involvement","exogenous",
                   "police discretion","exogenous","exogenous",
                   "exogenous","exogenous","exogenous",
                   "exogenous","exogenous","exogenous",
                   "exogenous","exogenous","exogenous"))



names <- c("IV_label", "Subcategory", "Category") |>
  purrr::map(~pull(df, .x)) |>
       purrr::reduce(union)

## adjacency matrix
adj <- matrix(0,
              nrow = length(names),
              ncol = length(names),
              dimnames = list(names, names))

adj[cbind(df[,2], df[,1])] <- 1
adj[cbind(df[,3], df[,2])] <- 1

setNames(asplit(adj, 1),names) |>
  purrr::map(~names[which(.x == 1)]) |>
  purrr::keep(~length(.x) > 0)

Output:

$`road type`
[1] "road type"              "surface type"           "surface characteristic"
[4] "amount of lanes"       

$`day of the week`
[1] "day of the week"

$`time of day`
[1] "light conditions"

$`victim characteristics`
[1] "vital status"

$workload
[1] "reserve density"

$`weather type`
[1] "weather"

$exogenous
[1] "road type"       "day of the week" "time of day"     "weather type"   

$`human involvement`
[1] "victim characteristics"

$`police discretion`
[1] "workload"

You probably may want to unset the diagonal of adj to avoid self referencing edges:

adj[row(adj) == col(adj)] <- 0

setNames(asplit(adj, 1),names) |>
  purrr::map(~names[which(.x == 1)]) |>
  purrr::keep(~length(.x) > 0)

output:

$`road type`
[1] "surface type"           "surface characteristic" "amount of lanes"       

$`time of day`
[1] "light conditions"

$`victim characteristics`
[1] "vital status"

$workload
[1] "reserve density"

$`weather type`
[1] "weather"

$exogenous
[1] "road type"       "day of the week" "time of day"     "weather type"   

$`human involvement`
[1] "victim characteristics"

$`police discretion`
[1] "workload"