Home > Net >  Should I be using unnest_wider and rowMeans to get the average of a list column?
Should I be using unnest_wider and rowMeans to get the average of a list column?

Time:11-30

I have a simple data set. The row names are a meaningful index and column 1 has a list of values. What I eventually want is the average of that list for each row name.

What it looks like now:

row name years
108457 [1200, 1200, 1540, 1890]
237021 [1600, 1270, 1270]

What I eventually want it to look like:

row name years
108457 mean of list
237021 mean of list

Currently, I'm trying to use unnest_wider(years). My plan is to then afterwards use rowMeans() to find the mean of the unnested row. I can then merge the row name and average value with my main data set, so I'm not too concerned with deleting the new columns.

However, this whole process is taking a while and I'm having some issues with unnest_wider. Currently, when I try:

unnest_wider(dataset, colname)

I get the following error:

Error in as_indices_impl(): ! Must subset columns with a valid subscript vector. ✖ Subscript has the wrong type data.frame<years:list>. ℹ It must be numeric or character.

When I try:

unnest_wider(colname)

My computer just runs endlessly and it looks like it's counting... it doesn't stop and I have to quit the application to terminate processing.

I had previously tried to directly apply rowMeans, use mean(df$ColName), and use apply(ColName, mean).

I wonder if there's a more efficient way?

It may be that I shouldn't have created the list in the first place. It looks like it does now because I converted it from this format:

Column A Column B
108457 1200
108457 1200
108457 1540
237021 1600
108457 1890
237021 1270

I converted it using pivot_wider and then as.data.frame.(t(dataset))

Should I have tried to get the averages directly from this format? If so, how would I do that?

CodePudding user response:

For your vectors in each row, you can use sapply to iterate over each row to calculate the mean, then just return the mean for each row name.

df$years <- sapply(df$years, mean, na.rm = TRUE)

Output

        years
108457 1457.5
237021 1380.0

Data

df <- structure(list(years = structure(list(c(1200, 1200, 1540, 1890
), c(1600, 1270, 1270)), class = "AsIs")), row.names = c("108457", 
"237021"), class = "data.frame")

Or we can use data.table to get the mean, if your data look like the latter, long format dataset.

library(data.table)

as.data.table(df2)[, list(ColumnB = mean(ColumnB)), by = ColumnA]

Output

   ColumnA ColumnB
1:  108457  1457.5
2:  237021  1435.0

Data

df2 <- structure(list(ColumnA = c(108457L, 108457L, 108457L, 237021L, 
108457L, 237021L), ColumnB = c(1200L, 1200L, 1540L, 1600L, 1890L, 
1270L)), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response:

If your original data look like they do in the latter table, you could simply find the mean by group based on ColumnA:

Data

df <- read.table(text = "ColumnA    ColumnB
108457  1200
108457  1200
108457  1540
237021  1600
108457  1890
237021  1270", header = TRUE)

Base R

aggregate(df$ColumnB, list(df$ColumnA), FUN=mean) 

# Group.1      x
# 1  108457 1457.5
# 2  237021 1435.0

Dplyr

library(dplyr)
df %>% 
  group_by(ColumnA) %>%
  summarise(mean_years = mean(ColumnB))

#  ColumnA mean_years
#    <int>      <dbl>
#1  108457      1458.
#2  237021      1435 
  • Related