I am writing a script to clean a very long data frame where I have a key-value pair in two columns named parameter and value, where parameter is a factor with multiple levels, and value can be either a date, a number, or a string.
In the example below I use as.numeric()
, lubridate::dmy_hms()
and use the NAs
they generate to understand what is not numeric and not a date. This somehow allows me to generate different lists based on the value type and then split them based on their parameter type. I guess the rexprex is more explanatory than my description.
The issue is this approach seems slow, convoluted, and somehow patchy. Is there a better way to do this?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- data.table::data.table(
value = c(
"13.249",
"NN",
"13.763",
"13.665",
"09/Mar/2022 11:05:12",
"12.276",
"13.405",
"13.684",
"13.744",
"14.201"
),
parameter = as.factor(
c(
"volume",
"signature",
"volume",
"volume",
"inoculated",
"volume",
"volume",
"volume",
"volume",
"volume"
)
)
)
# Finds the numeric part of the data frame and separates it from the rest
vector_numeric_df <- df %>%
mutate(value = as.numeric(value)) %>%
filter(!is.na(value)) %>%
group_by(parameter) %>%
summarise(n = n()) %>%
select(parameter) %>%
pull() %>%
as.vector()
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
vector_non_numeric_df <- df %>%
mutate(value = as.numeric(value)) %>%
filter(is.na(value)) %>%
group_by(parameter) %>%
summarise(n = n()) %>%
select(parameter) %>%
pull() %>%
as.vector()
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
numeric_df <- df %>%
filter(parameter %in% vector_numeric_df) %>%
mutate(value = as.numeric(value))
numeric_df_list <- numeric_df %>%
group_split(parameter) %>%
setNames(sort(unique(numeric_df$parameter)))
# Finds the date part of the data frame and separates it from the rest
vector_date_df <- df %>%
filter(parameter %in% vector_non_numeric_df) %>%
mutate(value = dmy_hms(value)) %>%
filter(!is.na(value)) %>%
group_by(parameter) %>%
summarise(n = n()) %>%
select(parameter) %>%
pull() %>%
as.vector()
#> Warning: 1 failed to parse.
date_df <- df %>%
filter(parameter %in% vector_date_df) %>%
mutate(value = dmy_hms(value))
date_df_list <- date_df %>%
group_split(parameter) %>%
setNames(sort(unique(date_df$parameter)))
# Makes everything else a character
vector_char_df <- df %>%
filter(!parameter %in% vector_numeric_df,
!parameter %in% vector_date_df) %>%
group_by(parameter) %>%
summarise(n = n()) %>%
select(parameter) %>%
pull() %>%
as.vector()
char_df <- df %>%
filter(parameter %in% vector_char_df) %>%
mutate(value = as.character(value))
char_df_list <- char_df %>%
group_split(parameter) %>%
setNames(sort(unique(char_df$parameter)))
# Generates all the objects
list2env(numeric_df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
list2env(date_df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
list2env(char_df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
str(inoculated)
#> tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
#> $ value : POSIXct[1:1], format: "2022-03-09 11:05:12"
#> $ parameter: Factor w/ 3 levels "inoculated","signature",..: 1
str(signature)
#> tibble [1 x 2] (S3: tbl_df/tbl/data.frame)
#> $ value : chr "NN"
#> $ parameter: Factor w/ 3 levels "inoculated","signature",..: 2
str(volume)
#> tibble [8 x 2] (S3: tbl_df/tbl/data.frame)
#> $ value : num [1:8] 13.2 13.8 13.7 12.3 13.4 ...
#> $ parameter: Factor w/ 3 levels "inoculated","signature",..: 3 3 3 3 3 3 3 3
Created on 2022-10-17 with reprex v2.0.2
CodePudding user response:
Since each parameter
has a different data type, we can split()
by parameter
and then use parse_guess()
to do most of the work. Your date data are non-standard and need a little special handling. Lastly, I usually try to avoid dumping related data into separate environment objects, but since it's apparently what you want, here's how to do it:
library(tidyverse)
library(lubridate)
# split the data frame by 'parameter' and attempt to guess each value format
df_split <- split(df, df$parameter) %>%
lapply(\(x) x %>% mutate(value = parse_guess(value)))
# the date data have a non-standard format and need special handling
df_split$inoculated <- df_split$inoculated %>%
mutate(value = dmy_hms(value))
# place list objects in global environment: not good practice but it seems to be what you want
list2env(df_split, envir = .GlobalEnv)
CodePudding user response:
As jdobres points out, it seems you can simply do split by parameter. If you still want to split by data type try this:
df_type <- suppressWarnings(ifelse(
!is.na(as.numeric(df$value)),"volume",ifelse(
!is.na(dmy_hms(df$value)),"inoculated","signature")))
l <- split(df,df_type)
l$inoculated
l$signature
l$volume
CodePudding user response:
Others have pointed out that split
does what you want in terms of creating the separate data frames.
Assuming you don't want to use df$parameter
, or you can't trust it, you can just parse df$value
with your as.
functions to create the split factor:
split(df,
is.na(suppressWarnings(as.numeric(df$value))) # 1 if not numeric
2 * is.na(as.Date(df$value, "%d/%b/%Y %H:%m:%s")) # 2 if not date
)
$`1`
value parameter
5 09/Mar/2022 11:05:12 inoculated
$`2`
value parameter
1 13.249 volume
3 13.763 volume
4 13.665 volume
6 12.276 volume
7 13.405 volume
8 13.684 volume
9 13.744 volume
10 14.201 volume
$`3`
value parameter
2 NN signature
A small technical note, if there are any NAs in the df$value
column they will be dropped (because split
doesn't like them, and your parsing will return NAs)