I have a simple data set. The row names are a meaningful index and column 1 has a list of values. What I eventually want is the average of that list for each row name.
What it looks like now:
row name | years |
---|---|
108457 | [1200, 1200, 1540, 1890] |
237021 | [1600, 1270, 1270] |
What I eventually want it to look like:
row name | years |
---|---|
108457 | mean of list |
237021 | mean of list |
Currently, I'm trying to use unnest_wider(years)
. My plan is to then afterwards use rowMeans()
to find the mean of the unnested row. I can then merge the row name and average value with my main data set, so I'm not too concerned with deleting the new columns.
However, this whole process is taking a while and I'm having some issues with unnest_wider
. Currently, when I try:
unnest_wider(dataset, colname)
I get the following error:
Error in
as_indices_impl()
: ! Must subset columns with a valid subscript vector. ✖ Subscript has the wrong typedata.frame<years:list>
. ℹ It must be numeric or character.
When I try:
unnest_wider(colname)
My computer just runs endlessly and it looks like it's counting... it doesn't stop and I have to quit the application to terminate processing.
I had previously tried to directly apply rowMeans
, use mean(df$ColName)
, and use apply(ColName, mean)
.
I wonder if there's a more efficient way?
It may be that I shouldn't have created the list in the first place. It looks like it does now because I converted it from this format:
Column A | Column B |
---|---|
108457 | 1200 |
108457 | 1200 |
108457 | 1540 |
237021 | 1600 |
108457 | 1890 |
237021 | 1270 |
I converted it using pivot_wide
r and then as.data.frame.(t(dataset))
Should I have tried to get the averages directly from this format? If so, how would I do that?
CodePudding user response:
For your vectors in each row, you can use sapply
to iterate over each row to calculate the mean, then just return the mean for each row name.
df$years <- sapply(df$years, mean, na.rm = TRUE)
Output
years
108457 1457.5
237021 1380.0
Data
df <- structure(list(years = structure(list(c(1200, 1200, 1540, 1890
), c(1600, 1270, 1270)), class = "AsIs")), row.names = c("108457",
"237021"), class = "data.frame")
Or we can use data.table
to get the mean, if your data look like the latter, long format dataset.
library(data.table)
as.data.table(df2)[, list(ColumnB = mean(ColumnB)), by = ColumnA]
Output
ColumnA ColumnB
1: 108457 1457.5
2: 237021 1435.0
Data
df2 <- structure(list(ColumnA = c(108457L, 108457L, 108457L, 237021L,
108457L, 237021L), ColumnB = c(1200L, 1200L, 1540L, 1600L, 1890L,
1270L)), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
If your original data look like they do in the latter table, you could simply find the mean by group based on ColumnA
:
Data
df <- read.table(text = "ColumnA ColumnB
108457 1200
108457 1200
108457 1540
237021 1600
108457 1890
237021 1270", header = TRUE)
Base R
aggregate(df$ColumnB, list(df$ColumnA), FUN=mean)
# Group.1 x
# 1 108457 1457.5
# 2 237021 1435.0
Dplyr
library(dplyr)
df %>%
group_by(ColumnA) %>%
summarise(mean_years = mean(ColumnB))
# ColumnA mean_years
# <int> <dbl>
#1 108457 1458.
#2 237021 1435