Cross-tab in R with data.tables-CodePudding

Sorry if this question has been asked, I played with my toy data to learn to manipulate data.tables. My goal was from this data:

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                        to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

to arrive at this result:

final_matrix
     L    A    B    C    D    E    F
1:   A    3    1    2 <NA>    1 <NA>
2:   B    1    0 <NA> <NA> <NA> <NA>
3:   C    2 <NA>    0    1    1 <NA>
4:   D <NA> <NA>    1    0 <NA> <NA>
5:   E    1 <NA>    1 <NA>    1    1
6:   F <NA> <NA> <NA> <NA>    1    0
7: tot    7    1    4    1    4    1

(eventually also with zeros instead of NAs, but got bored). I suppose in STATA this would be an easy cross-tab, I have built a function then looped over the unique values in the cols (sigh :/) merged the tables and then added a final line with the totals. Now although I've learned a lot, I wonder what would the clean R way to obtain such cross-tabs be? since the following doesn't work:

table(toy_data$from,toy_data$to)
   
    A B C D E F
  A 3 1 1 0 1 0
  C 1 0 0 1 0 0
  E 0 0 1 0 1 1

Thanks. My function if you have general improvements or best practices I am super happy:

create_edge_cols<- function(dt,column){
  #this function takes a df and a column, 
  #computes the number of edges among this column and all the other in dt
  #returns a column (list) with the cross-tabulation of columns
  tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
  print(tot_edges_i)
  # now tabulate links of column
  tab = data.table(table(unlist(dt[(from==column&to!=column)|
                                           (from!=column&to==column)])))
  setnames(tab, "V1", "L")
  setnames(tab, "N", column)
  setorder(tab,"L")
  tab[L==column,column] = length(dt[to==column & to == from,from])
  #tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
  return(tab)
}

#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))

#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
  info_column = copy(create_edge_cols(toy_data,col))
  first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}

## function to set first row as name
header.true <- function(df) {
  names(df) <- as.character(unlist(df[1,]))
  df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row

# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix

CodePudding user response：

Here is a way. What is missing in the question's table statement are factor levels, table is only processing what is in the data. Coerce the columns to factors with the same levels and assign NA to counts equal to zero.

There is also a print issue, see the final two instructions. The default for S# class "table" method print is not to print NA's. This can be changed manually.

library(data.table)

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                      to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
is.na(tbl) <- tbl == 0
tbl
#>     to
#> from  A  B  C  D  E  F
#>    A  3  1  1     1   
#>    B                  
#>    C  1        1      
#>    D                  
#>    E        1     1  1
#>    F

print(tbl, na.print = NA)
#>     to
#> from    A    B    C    D    E    F
#>    A    3    1    1 <NA>    1 <NA>
#>    B <NA> <NA> <NA> <NA> <NA> <NA>
#>    C    1 <NA> <NA>    1 <NA> <NA>
#>    D <NA> <NA> <NA> <NA> <NA> <NA>
#>    E <NA> <NA>    1 <NA>    1    1
#>    F <NA> <NA> <NA> <NA> <NA> <NA>

^{Created on 2022-03-28 by the reprex package (v2.0.1)}

CodePudding user response：

You could try using igraph which looks like it may give a similar result.

This uses na.omit to remove rows containing NA. The graph_from_data_frame will generate a graph (non-directed).

as_adjacency_matrix will create an adjacency matrix, using "both" for the type argument for symmetric matrix end result.

The final statement uses order to sort the rows and columns, and addmargins to include the sums for each column.

library(igraph)

g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)

Output

    A B C D E F
A   3 1 2 0 1 0
B   1 0 0 0 0 0
C   2 0 0 1 1 0
D   0 0 1 0 0 0
E   1 0 1 0 1 1
F   0 0 0 0 1 0
Sum 7 1 4 1 4 1