I've built a package of tools in R for doing things on a project and I want to share the package with my colleagues in a user friendly way.
There are a number of data files built into the package - including many which work with the functions I've developed. When I use my own package I don't mind loading the data file and then using that with my functions. However for other users I don't want them to have the hassle of doing that, I just want them to use the function and for the dataset to be used in the background.
I should say the data isn't secret, I don't need to keep it from the users, I just don't want it to distract them when they type my package name and have to select from a long list of functions in a sea of data files.
So I would do something like:
user_data <- c("Strawberry","Pistachio","Chocolate Chip")
my_df <- fat_pats_tools::ice_cream_flavours
results <- fat_pats_tools::ice_cream_detector_function(user_data, my_df)
But I would like my users to be able to do:
user_data <- c("Strawberry","Pistachio","Chocolate Chip")
results <- fat_pats_tools::ice_cream_detector_function(user_data) # using the internal ice_cream_flavours data
And I would also like them to only see the list of functions when they type 'fat_pats_tools::' in RStudio, not get lost in a load of data file names.
So two questions I'd be grateful for some advice on:
- How do I add data to a package which is accessible to my functions but not my users? (I currently use usethis:: and devtools:: to create the 'public' data)
- How do I reference the private data created within my functions so R knows to search within the current package (e.g. 'fat_pats_tools')
I've struggled to find an answer to this online as most assumes the data is secret and needs encryption etc or needs to be in a repository such as Github etc. Mine is just to make my package easier/slicker to use for people new to R/RStudio particularly as there could be around ten data packages used by functions.
Thanks in advice for your help and apologies if I missed something obvious!
CodePudding user response:
Here's a walk-through for how to formally add data to a package and make it the default data for a function.
(All of this is documented in https://r-pkgs.org/data.html and other places.)
Public Data, Same Package
devtools::create("mypkg")
# ✔ Creating 'C:/Users/r2/StackOverflow/20770390/75193911/mypkg/'
# ✔ Setting active project to 'C:/Users/r2/StackOverflow/20770390/75193911/mypkg'
# ✔ Creating 'R/'
# ✔ Writing 'DESCRIPTION'
# Package: mypkg
# Title: What the Package Does (One Line, Title Case)
# Version: 0.0.0.9000
# Authors@R (parsed):
# * First Last <[email protected]> [aut, cre] (YOUR-ORCID-ID)
# Description: What the package does (one paragraph).
# License: `use_mit_license()`, `use_gpl3_license()` or friends to
# pick a license
# Encoding: UTF-8
# Roxygen: list(markdown = TRUE)
# RoxygenNote: 7.2.3
# ✔ Writing 'NAMESPACE'
# ✔ Setting active project to '<no active project>'
setwd("mypkg")
Optionally set up data-raw
, which helps you to formalize a process for creating the data.
usethis::use_data_raw("mydata", FALSE)
# ✔ Setting active project to 'C:/Users/r2/StackOverflow/20770390/75193911/mypkg'
# ✔ Creating 'data-raw/'
# ✔ Adding '^data-raw$' to '.Rbuildignore'
# ✔ Writing 'data-raw/mydata.R'
# • Finish the data preparation script in 'data-raw/mydata.R'
# • Use `usethis::use_data()` to add prepared data to package
Now edit the data-raw/mydata.R
file to read:
mydata <- mtcars[1:4, 1:3]
usethis::use_data(mydata, overwrite = TRUE)
and source the file. If you don't want to use data-raw/..
, you can simply call the use_data(..)
command there manually (with one or more datasets that you've defined elsewhere).
From here, let's write a function in R/fun.R
:
#' Pass-through to head
#'
#' @param n integer
#' @param data data, defaults to mypkg::mydata
#' @return data.frame
#' @export
myfun <- function(n = 3, data = mypkg::mydata) utils::head(data, n = n)
Now we can document (which loads) and use it.
devtools::document()
# ℹ Updating mypkg documentation
# ℹ Loading mypkg
# Writing NAMESPACE
# Writing myfun.Rd
myfun(1)
# mpg cyl disp
# Mazda RX4 21 6 160
myfun(1, data=mtcars[4:6,1:5])
# mpg cyl disp hp drat
# Hornet 4 Drive 21.4 6 258 110 3.08
Though not required, you can document your dataset by adding a file such as R/mydata.R
:
#' My data, a subset of mtcars
#'
#' A subset of data from the infamous mtcars dataset
#'
#' @format ## `who`
#' A data frame with 4 rows and 3 columns:
#' \describe{
#' \item{mpg}{Miles per gallon}
#' \item{cyl}{Number of cylinders}
#' \item{disp}{Displacement}
#' ...
#' }
"mydata"
then devtools::document()
again, and now your users can (if they choose) read ?mypkg::mydata
.
"Private" Data, Same Package
If having the users see the data.frame names when they type in mypkg::<tab>
really is something to be avoided, then instead of making the data public, you can make it private using
usethis::use_data_raw("privdata", FALSE)
and in the data-raw/privdata.R
file,
privdata <- iris[1:3,]
usethis::use_data(privdata, overwrite = TRUE, internal = TRUE)
When this is sourced, we find R/sysdata.rda
and nothing new in data/..
.
Once we document, we can see that it is not readily visible but can still be accessed,
mypkg::privdata
# Error: 'privdata' is not an exported object from 'namespace:mypkg'
mypkg:::privdata
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
which means we would modify our function to be:
myfun
# function(n = 3, data = privdata) utils::head(data, n = n)
# <environment: namespace:mypkg>
myfun()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
(No need for mypkg:::
in the function declaration, and use of :::
in packages is generally discouraged anyway. Frankly, we could probably do away with it in the first function definition well-above.)
Public Data, Different Package
(I should note up front that this helps to mitigate your concerns for users seeing frame names when they type mypkg::<tab>
, since the data is not under mypkg::
but under mypkgdata::<tab>
instead.)
If your data is large, if the change frequency for functions versus data are significantly different, if the dev/release cycle is executed by different people or through different policy channels, etc ... it might be advantageous to have separate packages for functions and data. This pattern is used in (for example) naturalearth
with its naturalearthdata
package (see https://blog.r-hub.io/2020/05/29/distribute-data/).
devtools::create("mypkgdata")
setwd("mypkgdata")
usethis::use_data_raw("mydata")
### edit `data-raw/mydata.R` as above and source it
### optionally document the data in `R/mydata.R` as above
devtools::document()
devtools::install() ## optionally `::build()` it for others
Now go back to the mypkg
package to:
update the function to use the new data, notice the
@import
roxygen2 tag#' Pass-through to head #' #' @param n integer #' @param data data, defaults to mypkg::mydata #' @return data.frame #' @export #' @import mypkgdata myfun <- function(n = 3, data = mypkgdata::mydata) utils::head(data, n = n)
remove the data files from
mypkg
:data-raw/mydata.R
,data/mydata.rda
, andR/mydata.R
(if you documented it); you can either manually removeman/mydata.Rd
or rerundevtools::document()
update the
DESCRIPTION
file:usethis::use_package("mypkgdata") # ✔ Adding 'mypkgdata' to Imports field in DESCRIPTION # • Refer to functions with `mypkgdata::fun()`
(Notice that this step both adds
mypkgdata
to theDESCRIPTION
fileImports:
section as well as addingimport(mypkgdata)
toNAMESPACE
, both are essential.)
With all that, it works as before.
myfun
# function(n = 3, data = mypkgdata::mydata) utils::head(data, n = n)
# <environment: namespace:mypkg>
myfun(3)
# mpg cyl disp
# Mazda RX4 21.0 6 160
# Mazda RX4 Wag 21.0 6 160
# Datsun 710 22.8 4 108