Should I be using unnest_wider and rowMeans to get the average of a list column?-CodePudding

I have a simple data set. The row names are a meaningful index and column 1 has a list of values. What I eventually want is the average of that list for each row name.

What it looks like now:

row name	years
108457	[1200, 1200, 1540, 1890]
237021	[1600, 1270, 1270]

What I eventually want it to look like:

row name	years
108457	mean of list
237021	mean of list

Currently, I'm trying to use unnest_wider(years). My plan is to then afterwards use rowMeans() to find the mean of the unnested row. I can then merge the row name and average value with my main data set, so I'm not too concerned with deleting the new columns.

However, this whole process is taking a while and I'm having some issues with unnest_wider. Currently, when I try:

unnest_wider(dataset, colname)

I get the following error:

Error in as_indices_impl(): ! Must subset columns with a valid subscript vector. ✖ Subscript has the wrong type data.frame<years:list>. ℹ It must be numeric or character.

When I try:

unnest_wider(colname)

My computer just runs endlessly and it looks like it's counting... it doesn't stop and I have to quit the application to terminate processing.

I had previously tried to directly apply rowMeans, use mean(df$ColName), and use apply(ColName, mean).

I wonder if there's a more efficient way?

It may be that I shouldn't have created the list in the first place. It looks like it does now because I converted it from this format:

Column A	Column B
108457	1200
108457	1200
108457	1540
237021	1600
108457	1890
237021	1270

I converted it using pivot_wider and then as.data.frame.(t(dataset))

Should I have tried to get the averages directly from this format? If so, how would I do that?

CodePudding user response：

For your vectors in each row, you can use sapply to iterate over each row to calculate the mean, then just return the mean for each row name.

df$years <- sapply(df$years, mean, na.rm = TRUE)

Output

        years
108457 1457.5
237021 1380.0

Data

df <- structure(list(years = structure(list(c(1200, 1200, 1540, 1890
), c(1600, 1270, 1270)), class = "AsIs")), row.names = c("108457", 
"237021"), class = "data.frame")

Or we can use data.table to get the mean, if your data look like the latter, long format dataset.

library(data.table)

as.data.table(df2)[, list(ColumnB = mean(ColumnB)), by = ColumnA]

Output

   ColumnA ColumnB
1:  108457  1457.5
2:  237021  1435.0

Data

df2 <- structure(list(ColumnA = c(108457L, 108457L, 108457L, 237021L, 
108457L, 237021L), ColumnB = c(1200L, 1200L, 1540L, 1600L, 1890L, 
1270L)), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response：

If your original data look like they do in the latter table, you could simply find the mean by group based on ColumnA:

Data

df <- read.table(text = "ColumnA    ColumnB
108457  1200
108457  1200
108457  1540
237021  1600
108457  1890
237021  1270", header = TRUE)

Base R

aggregate(df$ColumnB, list(df$ColumnA), FUN=mean) 

# Group.1      x
# 1  108457 1457.5
# 2  237021 1435.0

Dplyr

library(dplyr)
df %>% 
  group_by(ColumnA) %>%
  summarise(mean_years = mean(ColumnB))

#  ColumnA mean_years
#    <int>      <dbl>
#1  108457      1458.
#2  237021      1435