| Title: | Simple Data Versioning |
|---|---|
| Description: | Simple dataversioning using GitHub to store data. |
| Authors: | Rich FitzJohn |
| Maintainer: | Rich FitzJohn <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.4 |
| Built: | 2026-06-01 08:49:44 UTC |
| Source: | https://github.com/traitecoevo/datastorr |
Autogenerate an datastorr interface for a package. The idea is to run this function and save the resulting code in a file in your package. Then users will be able to download data and you will be able to relase data easily.
autogenerate(repo, read, filename = NULL, name = basename(repo), roxygen = TRUE)autogenerate(repo, read, filename = NULL, name = basename(repo), roxygen = TRUE)
repo |
Name of the repo on github (in username/repo format) |
read |
name of a function to read the data. Do not give the function itself! |
filename |
Name of the file to read. If not given, then the single file in a release will be read (but you will need to provide a filename on upload). If given, you cannot change the filename ever as all releases will be assumed to have the same filename. |
name |
Name of the dataset, used in generating the functions. If omitted the repo name is used. |
roxygen |
Include roxygen headers for the functions? |
In addition to running this, you will need to add datastorr
to the Imports: section of your DESCRIPTION. To upload
files you will need to set your GITHUB_TOKEN environment
variable. These steps will be described more fully in a vignette.
More complete instructions:
Let pkg be basename(repo); the name of the package
and of the GitHub repository.
First, create a new R package, e.g. devtools::create(pkg).
Then, copy the result of running autogenerate into a file
in that package, e.g.
writeLines(autogenerate(repo, read),
file.path(pkg, "datastorr.R"))
devtools::document(pkg)
Create a new git repository for this package, and add all the files in the package, and commit.
On GitHub, create a repository for the package and push your code there.
At this point you are now ready to start making releases by
loading your package and running pkg::<name>_release().
writeLines(autogenerate("richfitz/datastorr.example", read = "readRDS", name = "mydata")) writeLines(autogenerate("richfitz/datastorr.example", read = "readRDS", name = "mydata", roxygen = FALSE))writeLines(autogenerate("richfitz/datastorr.example", read = "readRDS", name = "mydata")) writeLines(autogenerate("richfitz/datastorr.example", read = "readRDS", name = "mydata", roxygen = FALSE))
Create a lightweight datastorr interface (rather than using the full package approach). This approach is designed for the "files that don't fit in git" use-case.
datastorr(repo, path = NULL, metadata = "datastorr.json", branch = "master", private = FALSE, refetch = FALSE, version = NULL, extended = FALSE) datastorr_versions(..., local = TRUE)datastorr(repo, path = NULL, metadata = "datastorr.json", branch = "master", private = FALSE, refetch = FALSE, version = NULL, extended = FALSE) datastorr_versions(..., local = TRUE)
repo |
Either a github repo in the form
|
path |
The path to store the data at. Using |
metadata |
The name of the metadata file within the repo (if
|
branch |
The branch in the repo to use. Default is
|
private |
A logical indicating if the repository is private and therefor if authentication will be needed to access it. |
refetch |
Refetch the metadata file even if it has already been downloaded previously. |
version |
Which version to download (if |
extended |
Don't fetch the data, but instead return an object that can query data, versions, etc. |
... |
Arguments passed through to |
local |
Return information on local versions? |
Note that the package approach is likely to scale better; in particular it allows for the reading function to be arbitrarily complicated, allows for package installation and loading, etc. With this simple interface you will need to document your dependencies carefully. But it does remove the requirement for making a package and will likely work pretty well as part of an analysis pipeline where your dependencies are well documented anyway.
## Not run: path <- tempfile() dat <- datastorr("richfitz/data", path, extended = TRUE) dat$list() dat() ## End(Not run)## Not run: path <- tempfile() dat <- datastorr("richfitz/data", path, extended = TRUE) dat$list() dat() ## End(Not run)
Authentication for accessing GitHub. This will first look for a
GitHub personal token (stored in the GITHUB_TOKEN or
GITHUB_PAT environment variables, and then try
authenticating with OAuth.
datastorr_auth(required = FALSE, key = NULL, secret = NULL, cache = TRUE, token_only = FALSE) setup_github_token(path = "~/.Renviron")datastorr_auth(required = FALSE, key = NULL, secret = NULL, cache = TRUE, token_only = FALSE) setup_github_token(path = "~/.Renviron")
required |
Is authentication required? Reading from public repositories does not require authentication so there's no point worrying if we can't get it. datastorr will set this when appropriate internally. |
key, secret
|
The application key and secret. If |
cache |
Logical, indicating whether we should cache the
token. If |
token_only |
return the token only |
path |
Path to environment file; the default is the user environment variable file which is usually a good choice. |
Run this datastorr_auth function to force setting up
authentication with OAuth. Alternatively, run
setup_github_token to set up a personal access token.
Either can be revoked at any time
https://github.com/settings/tokens to revke a personal
access token and https://github.com/settings/applications to
revoke the OAuth token.
Location of datastorr files. This is determined by
rappdirs using the user_data_dir function.
Alternatively, if the option datastorr.path is set, that is
used for the base path. The path to data from an actual repo is
stored in a subdirectory under this directory.
datastorr_path(repo = NULL)datastorr_path(repo = NULL)
repo |
An optional repo (of the form |
Files in this directory can be deleted at will (e.g., running
unlink(datastorr_path(), recursive = TRUE) will delete all
files that datstorr has ever downloaded. The only issue here is
that the OAuth token (used to authenticate with GitHub) is also
stored in this directory.
Create a github release for your package. This tries very hard to
do the right thing but it's not always straightforward. It first
looks for your package. Then it will work out what your last
commit was (if target is NULL), the version of the package
(from the DESCRIPTION). It then creates a release on GitHub with
the appropriate version number and uploads the file
filename to the release. The version number in the
DESCRIPTION must be greater than the highest version number on
GitHub.
github_release_create(info, description = NULL, filenames = NULL, target = NULL, ignore_dirty = FALSE, yes = !interactive())github_release_create(info, description = NULL, filenames = NULL, target = NULL, ignore_dirty = FALSE, yes = !interactive())
info |
Result of running |
description |
Optional text description for the release. If this is omitted then GitHub will display the commit message from the commit that the release points at. |
target |
Target of the release. This can be either the name
of a branch (e.g., |
ignore_dirty |
Ignore non-checked in files? By default, your repository is expected to be in a clean state, though files not known to git are ignored (as are files that are ignored by git). But you must have no uncommited changes or staged but uncommited files. |
yes |
Skip the confirmation prompt? Only prompts if interactive. |
filename |
Filename to upload; optional if in |
This function requires a system git to be installed and on the path. The version does not have to be particularly recent.
This function also requires the GITHUB_TOKEN environment
variable to be set, and for the token to be authorised to have
write access to your repositories.
Delete a local copy of a version (or all local copies). Note that that does not affect the actual github release in any way!.
github_release_del(info, version)github_release_del(info, version)
info |
Result of running |
version |
Version to delete. If |
Get a version of a data set, downloading it if necessary.
github_release_get(info, version = NULL)github_release_get(info, version = NULL)
info |
Result of running |
version |
Version to fetch. If |
Information to describe how to process github releases
github_release_info(repo, read, private = FALSE, filename = NULL, path = NULL)github_release_info(repo, read, private = FALSE, filename = NULL, path = NULL)
repo |
Name of the repo in |
read |
Function to read the file. See Details. |
private |
Is the repository private? If so authentication will be required for all actions. Setting this is optional but will result in better error messages because of the way GitHub returns not found/404 (rather than forbidden/403) errors when accessing private repositories without authorisation. |
filename |
Optional filename. If omitted, all files in the
release can be used. If the filename contains a star ("*") it
will be treated as a filename glob. So you can do
|
path |
Optional path in which to store the data. If omitted
we use |
The simplest case is where the data are stored in a single file attached to the release (this is different to the zip/tar.gz files that the web interface displays). For example, a single csv file. In that case the filename argument can be safely ommited and we'll work it out based on the filename.
Get release versions
github_release_versions(info, local = TRUE) github_release_version_current(info, local = TRUE)github_release_versions(info, local = TRUE) github_release_version_current(info, local = TRUE)
info |
Result of running |
local |
Should we return local (TRUE) or github (FALSE)
version numbers? Github version numbers are pulled once per
session only. The exception is for
|
Rich FitzJohn
Create a relase for a simple datastorr (i.e., non-package based).
release(repo, version, description = NULL, filename = NULL, path = NULL, metadata = "datastorr.json", branch = "master", private = FALSE, refetch = FALSE, target = NULL, ignore_dirty = FALSE, yes = !interactive())release(repo, version, description = NULL, filename = NULL, path = NULL, metadata = "datastorr.json", branch = "master", private = FALSE, refetch = FALSE, target = NULL, ignore_dirty = FALSE, yes = !interactive())
repo |
Either a github repo in the form
|
version |
A version number for the new version. Should be of the form x.y.z, and may or may not contain a leading "v" (one will be added in any case). |
description |
Optional text description for the release. If this is omitted then GitHub will display the commit message from the commit that the release points at. |
filename |
Filename to upload; optional if in
|
path |
The path to store the data at. Using |
metadata |
The name of the metadata file within the repo (if
|
branch |
The branch in the repo to use. Default is
|
private |
A logical indicating if the repository is private and therefor if authentication will be needed to access it. |
refetch |
Refetch the metadata file even if it has already been downloaded previously. |
target |
The SHA or tag to attach the release to. By default, will use the current HEAD, which is typically what you want to do. |
ignore_dirty |
Ignore non-checked in files? By default, your repository is expected to be in a clean state, though files not known to git are ignored (as are files that are ignored by git). But you must have no uncommited changes or staged but uncommited files. |
yes |
Skip the confirmation prompt? Only prompts if interactive. |