Home > OS >  Coefficients for large dimension matrices/data frames
Coefficients for large dimension matrices/data frames

Time:02-20

I have a data frame of 70 (rows) x 64000 (cols).

I want to find the correlations between columns and rows for my data frame and sort them based on their absolute value. But when I use the coef() function I get NULL:

> coef(expressions70)
NULL

Is there any way to get coefficients similar to the paira.panels() output from the psych package? Or another way to show coefficients?

CodePudding user response:

No, there is no object created from pairs.panels. If you set it to an object, you'll see that the object's value is NULL. However, there are still several ways that you could look at this. (Although considering Anscombe's squares, I would suggest that you don't take the rho value at face value.)

Both options create a named vector as the output. The name is the two correlated fields. The output is the rho value.

If all your fields are numeric or if you know exactly what columns are numeric, use this first option. If you have dates, character, and factor fields mixed in the columns, then use the second option.

First option:

library(funModeling)
library(tidyverse)
library(RcppAlgos)

# create all combinations
tellMe <- comboGeneral(names(iris[,1:4]),
                       2, T) %>% 
  as.data.frame()
showMe <- map(1:nrow(tellMe),
              ~setNames(
                cor(iris[,tellMe[.x,1]], 
                    iris[,tellMe[.x,2]],
                    "everything", "pearson"),
                paste0(tellMe[.x, ], collapse = "-"))
              ) %>% 
  unlist() %>% sort(decreasing = T)
#   Sepal.Width-Sepal.Width Sepal.Length-Sepal.Length 
#                 1.0000000                 1.0000000 
# Petal.Length-Petal.Length   Petal.Width-Petal.Width 
#                 1.0000000                 1.0000000 
#  Petal.Length-Petal.Width Sepal.Length-Petal.Length 
#                 0.9628654                 0.8717538 
#  Sepal.Length-Petal.Width  Sepal.Length-Sepal.Width 
#                 0.8179411                -0.1175698 
#   Sepal.Width-Petal.Width  Sepal.Width-Petal.Length 
#                -0.3661259                -0.4284401 

Second option

This starts with identifying which fields are either integer or numeric, then follows the same path as the first.

I have to say, I started with select(where()) but where is a ::: for tidyselect now...so I went with an alternative method. If this doesn't make anything to you, just ignore this comment.

# if some variables are not numeric...
# apparently 'where' isn't in tidyselect anymore
fields <- df_status(iris) %>% 
  filter(type == "integer" | type == "numeric") %>% 
  select(variable) %>% 
  unlist(use.names = F)
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  

# find all possible combinations (with no repeats)
giveMe <- comboGeneral(fields, 2, T) %>% 
  as.data.frame()

itsShown <- map(1:nrow(giveMe),
                ~setNames(
                  cor(iris[,giveMe[.x,1]], 
                      iris[,giveMe[.x,2]],
                      "everything", "pearson"),
                  paste0(giveMe[.x, ], collapse = "-"))
                ) %>% 
  unlist() %>% sort(decreasing = T)
#   Sepal.Width-Sepal.Width Sepal.Length-Sepal.Length 
#                 1.0000000                 1.0000000 
# Petal.Length-Petal.Length   Petal.Width-Petal.Width 
#                 1.0000000                 1.0000000 
#  Petal.Length-Petal.Width Sepal.Length-Petal.Length 
#                 0.9628654                 0.8717538 
#  Sepal.Length-Petal.Width  Sepal.Length-Sepal.Width 
#                 0.8179411                -0.1175698 
#   Sepal.Width-Petal.Width  Sepal.Width-Petal.Length 
#                -0.3661259                -0.4284401 
  • Related