ACCOUNT = c(M205109, M205109, M201212, M205668, M207954, M208966, M203465, M207622, M201869, M201869)
age = c(20, 20, 18, 29, 21, 19, 19, 23, 22, 22)
The code I am using
library(tidyverse)
library(data.table)
library(dtplyr)
library(lubridate)
age_summary_all <- data %>%
distinct(ACCOUNT) %>%
summarise(min = min(age, na.rm=TRUE),
q1 = quantile(age, 0.25, na.rm=TRUE),
median = median(age,na.rm=TRUE),
mean = mean(age,na.rm=TRUE),
q3 = quantile(age, 0.75, na.rm=TRUE),
max = max(age, na.rm=TRUE))
I get this error:
Error in
summarise()
: ! Problem while computingmin = min(age, na.rm = TRUE)
. Caused by error inmask$eval_all_summarise()
: ! object 'age' not found
The really odd thing is that exactly the same code runs fine if the distinct is replaced by a group_by clause referring to a different column, but I need to run the analysis on unique individuals - the nature of the data is such that individual accounts are likely to have more than one entry in the data table. So, for the example above I would expect n = 8 for the summarise clause.
All the packages are definitely up to date.
CodePudding user response:
distinct
needs .keep_all=T
in this case.
.keep_all: If ‘TRUE’, keep all variables in ‘.data’. If a combination of ‘...’ is not distinct, this keeps the first row of values.
Call without .keep_all=T
library(dplyr)
data %>%
distinct(ACCOUNT)
# A tibble: 8 × 1
ACCOUNT
<chr>
1 M205109
2 M201212
3 M205668
4 M207954
5 M208966
6 M203465
7 M207622
8 M201869
Call with .keep_all=T
data %>%
distinct(ACCOUNT, .keep_all=T)
# A tibble: 8 × 2
ACCOUNT age
<chr> <dbl>
1 M205109 20
2 M201212 18
3 M205668 29
4 M207954 21
5 M208966 19
6 M203465 19
7 M207622 23
8 M201869 22
data %>%
distinct(ACCOUNT, .keep_all=T) %>%
summarise(min = min(age, na.rm=TRUE),
q1 = quantile(age, 0.25, na.rm=TRUE),
median = median(age,na.rm=TRUE),
mean = mean(age,na.rm=TRUE),
q3 = quantile(age, 0.75, na.rm=TRUE),
max = max(age, na.rm=TRUE))
# A tibble: 1 × 6
min q1 median mean q3 max
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18 19 20.5 21.4 22.2 29
Data
data <- structure(list(ACCOUNT = c("M205109", "M205109", "M201212", "M205668",
"M207954", "M208966", "M203465", "M207622", "M201869", "M201869"
), age = c(20, 20, 18, 29, 21, 19, 19, 23, 22, 22)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
CodePudding user response:
I think this is what you are trying to achieve:
account <- c("M205109", "M205109", "M201212", "M205668", "M207954", "M208966", "M203465", "M207622", "M201869", "M201869")
age <- c(20, 20, 18, 29, 21, 19, 19, 23, 22, 22)
library(dplyr)
data <- data.frame(account , age)
age_summary_all <-
data %>%
group_by(account) %>%
summarise(min = min(age, na.rm=TRUE),
q1 = quantile(age, 0.25, na.rm=TRUE),
median = median(age,na.rm=TRUE),
mean = mean(age,na.rm=TRUE),
q3 = quantile(age, 0.75, na.rm=TRUE),
max = max(age, na.rm=TRUE))
age_summary_all
# A tibble: 8 x 7
account min q1 median mean q3 max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 M201212 18 18 18 18 18 18
2 M201869 22 22 22 22 22 22
3 M203465 19 19 19 19 19 19
4 M205109 20 20 20 20 20 20
5 M205668 29 29 29 29 29 29
6 M207622 23 23 23 23 23 23
7 M207954 21 21 21 21 21 21
8 M208966 19 19 19 19 19 19