Sorry if this question has been asked, I played with my toy data to learn to manipulate data.tables. My goal was from this data:
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
to arrive at this result:
final_matrix
L A B C D E F
1: A 3 1 2 <NA> 1 <NA>
2: B 1 0 <NA> <NA> <NA> <NA>
3: C 2 <NA> 0 1 1 <NA>
4: D <NA> <NA> 1 0 <NA> <NA>
5: E 1 <NA> 1 <NA> 1 1
6: F <NA> <NA> <NA> <NA> 1 0
7: tot 7 1 4 1 4 1
(eventually also with zeros instead of NAs, but got bored). I suppose in STATA this would be an easy cross-tab, I have built a function then looped over the unique values in the cols (sigh :/) merged the tables and then added a final line with the totals. Now although I've learned a lot, I wonder what would the clean R way to obtain such cross-tabs be? since the following doesn't work:
table(toy_data$from,toy_data$to)
A B C D E F
A 3 1 1 0 1 0
C 1 0 0 1 0 0
E 0 0 1 0 1 1
Thanks. My function if you have general improvements or best practices I am super happy:
create_edge_cols<- function(dt,column){
#this function takes a df and a column,
#computes the number of edges among this column and all the other in dt
#returns a column (list) with the cross-tabulation of columns
tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
print(tot_edges_i)
# now tabulate links of column
tab = data.table(table(unlist(dt[(from==column&to!=column)|
(from!=column&to==column)])))
setnames(tab, "V1", "L")
setnames(tab, "N", column)
setorder(tab,"L")
tab[L==column,column] = length(dt[to==column & to == from,from])
#tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
return(tab)
}
#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))
#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
info_column = copy(create_edge_cols(toy_data,col))
first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}
## function to set first row as name
header.true <- function(df) {
names(df) <- as.character(unlist(df[1,]))
df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row
# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix
CodePudding user response:
Here is a way. What is missing in the question's table
statement are factor levels, table
is only processing what is in the data. Coerce the columns to factors with the same levels and assign NA
to counts equal to zero.
There is also a print
issue, see the final two instructions. The default for S# class "table"
method print is not to print NA
's. This can be changed manually.
library(data.table)
toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
levels <- sort(unique(unlist(toy_data)))
levels <- levels[!is.na(levels)]
toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
tbl <- table(toy_data)
is.na(tbl) <- tbl == 0
tbl
#> to
#> from A B C D E F
#> A 3 1 1 1
#> B
#> C 1 1
#> D
#> E 1 1 1
#> F
print(tbl, na.print = NA)
#> to
#> from A B C D E F
#> A 3 1 1 <NA> 1 <NA>
#> B <NA> <NA> <NA> <NA> <NA> <NA>
#> C 1 <NA> <NA> 1 <NA> <NA>
#> D <NA> <NA> <NA> <NA> <NA> <NA>
#> E <NA> <NA> 1 <NA> 1 1
#> F <NA> <NA> <NA> <NA> <NA> <NA>
Created on 2022-03-28 by the reprex package (v2.0.1)
CodePudding user response:
You could try using igraph
which looks like it may give a similar result.
This uses na.omit
to remove rows containing NA
. The graph_from_data_frame
will generate a graph (non-directed).
as_adjacency_matrix
will create an adjacency matrix, using "both" for the type
argument for symmetric matrix end result.
The final statement uses order
to sort the rows and columns, and addmargins
to include the sums for each column.
library(igraph)
g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)
Output
A B C D E F
A 3 1 2 0 1 0
B 1 0 0 0 0 0
C 2 0 0 1 1 0
D 0 0 1 0 0 0
E 1 0 1 0 1 1
F 0 0 0 0 1 0
Sum 7 1 4 1 4 1