To share your data using Armadillo, you can see the different relevant steps you can take before the shared data can be transferred to the researchers.
Login
In order to access the files as a data manager, you need to log in. The login method needs the URLs of the Armadillo server and the MinIO fileserver. It will open a browser window where you can identify yourself with the ID provider.
armadillo.login("https://armadillo.test.molgenis.org",
"https://armadillo-minio.test.molgenis.org")
#> [1] "We're opening a browser so you can log in with code 5HBGK8"
It will create a session and store the credentials in the environment.
Structure
If you need to share data via Armadillo you can have a nested structure to save you data.
We distinguish: - projects - folders - tables
Projects
Projects are a sort of root-folders you can give persons permissions on. you can imagine that you will use a seperate project for each study you need to support. This way you make sure people can not see eachothers variables.
Folders
Folder objects can be used to version the different tables you want to share in Armadillo. This is not mandatory and are free to use the folder level as you see fit. In our examples we will go into the versioning part a bit deeper.
Tables
Tables are actual tables in R which contain the data you want to share. This can be all the data on a certain subject, mostly used in consortia or a specific study you want to expose.
Sharing data
Let’s asume you are in a consortia which has core-variables and outcome-variables. You want to share and version the whole dataset to all researchers which applied to access your data.
First we will create the project. In our case “cohort1”.
armadillo.create_project("cohort1")
#> Created project 'cohort1'
Secondly we will load the table(s) we want to upload to Armadillo in the R-environment. We have test data which is in arrow
format, the upload will take any object that has a table like structure to upload into the Armadillo. This can be SPSS, STATA, SAS or R-based data as well.
library(arrow)
# load the core data
nonrep <- arrow::read_parquet("data/core/nonrep.parquet")
yearlyrep <- arrow::read_parquet("data/core/yearlyrep.parquet")
monthlyrep <- arrow::read_parquet("data/core/monthlyrep.parquet")
trimesterrep <- arrow::read_parquet("data/core/trimesterrep.parquet")
The third step is determining the second level, which contains in this case the datamodel-version the type of variables and the data-version.
y_y-#variable-type#-x_x
y_y = datamodel version x_x = data version
# upload the core variables
armadillo.upload_table("cohort1", "2_1-core-1_0", nonrep)
#> Compressing...
#>
|
| | 0%
|
|=========================== | 30%
|
|===================================================== | 60%
|
|================================================================================ | 90%
|
|========================================================================================| 100%
#> Uploaded 2_1-core-1_0/nonrep
armadillo.upload_table("cohort1", "2_1-core-1_0", yearlyrep)
#> Compressing...
#>
|
| | 0%
|
|========== | 11%
|
|==================== | 23%
|
|============================== | 34%
|
|======================================== | 46%
|
|================================================== | 57%
|
|============================================================ | 69%
|
|====================================================================== | 80%
|
|================================================================================= | 92%
|
|========================================================================================| 100%
#> Uploaded 2_1-core-1_0/yearlyrep
armadillo.upload_table("cohort1", "2_1-core-1_0", monthlyrep)
#> Compressing...
#>
|
| | 0%
|
|=== | 3%
|
|===== | 6%
|
|======== | 9%
|
|========== | 11%
|
|============= | 14%
|
|=============== | 17%
|
|================== | 20%
|
|==================== | 23%
|
|======================= | 26%
|
|========================= | 29%
|
|============================ | 31%
|
|============================== | 34%
|
|================================= | 37%
|
|=================================== | 40%
|
|====================================== | 43%
|
|======================================== | 46%
|
|=========================================== | 49%
|
|============================================= | 51%
|
|================================================ | 54%
|
|================================================== | 57%
|
|===================================================== | 60%
|
|======================================================= | 63%
|
|========================================================== | 66%
|
|============================================================ | 69%
|
|=============================================================== | 71%
|
|================================================================= | 74%
|
|==================================================================== | 77%
|
|====================================================================== | 80%
|
|========================================================================= | 83%
|
|=========================================================================== | 86%
|
|============================================================================== | 89%
|
|================================================================================ | 91%
|
|=================================================================================== | 94%
|
|===================================================================================== | 97%
|
|========================================================================================| 100%
#> Uploaded 2_1-core-1_0/monthlyrep
armadillo.upload_table("cohort1", "2_1-core-1_0", trimesterrep)
#> Compressing...
#>
|
| | 0%
|
|========================================================================================| 100%
#> Uploaded 2_1-core-1_0/trimesterrep
If you have more tables containing the same naming scheme you need to drop the data frames in memory and then you can reassign them. See the example below.
# drop the old assignments
rm(nonrep, yearlyrep, monthlyrep, trimesterrep)
# load the outcome data
yearlyrep <- arrow::read_parquet("data/outcome/yearlyrep.parquet")
nonrep <- arrow::read_parquet("data/outcome/nonrep.parquet")
# upload the outcome variables
armadillo.upload_table("cohort1", "1_1-outcome-1_0", nonrep)
#> Compressing...
#>
|
| | 0%
|
|========================================================================================| 100%
#> Uploaded 1_1-outcome-1_0/nonrep
armadillo.upload_table("cohort1", "1_1-outcome-1_0", yearlyrep)
#> Compressing...
#>
|
| | 0%
|
|========================================================================================| 100%
#> Uploaded 1_1-outcome-1_0/yearlyrep
Looking at the data
There are helper functions to help you determine what is in the storage server. You can list projects and tables to what’s in the storage.
# listing tables per project
armadillo.list_projects()
#> [1] "cohort1" "cohort2" "cohort3" "dnbc"
#> [5] "example-dictionary" "exampledictionary" "methylationclocks" "subset1"
# listing tables per project
armadillo.list_tables("cohort1")
#> [1] "1_1-outcome-1_0/nonrep" "1_1-outcome-1_0/yearlyrep" "2_1-core-1_0/monthlyrep"
#> [4] "2_1-core-1_0/nonrep" "2_1-core-1_0/trimesterrep" "2_1-core-1_0/yearlyrep"
You can download the data in the R-environment as well.
# download table to local R environment
trimesterrep <- armadillo.load_table("cohort1", "2_1-core-1_0", "trimesterrep")
# check the column names from the local environment
colnames(trimesterrep)
#> [1] "row_id" "child_id" "age_trimester" "smk_t" "alc_t"
Now you can also take a look at the files in the user interface of the MinIO fileserver if you open the MinIO server URL in a browser window.
Deleting the data
To delete the data you need to throw away the contents first.
# throw away the core tables
armadillo.delete_table("cohort1", "2_1-core-1_0", "nonrep")
#> Deleted '2_1-core-1_0/nonrep'.
armadillo.delete_table("cohort1", "2_1-core-1_0", "yearlyrep")
#> Deleted '2_1-core-1_0/yearlyrep'.
armadillo.delete_table("cohort1", "2_1-core-1_0", "trimesterrep")
#> Deleted '2_1-core-1_0/trimesterrep'.
armadillo.delete_table("cohort1", "2_1-core-1_0", "monthlyrep")
#> Deleted '2_1-core-1_0/monthlyrep'.
# throw away the outcome tables
armadillo.delete_table("cohort1", "1_1-outcome-1_0", "nonrep")
#> Deleted '1_1-outcome-1_0/nonrep'.
armadillo.delete_table("cohort1", "1_1-outcome-1_0", "yearlyrep")
#> Deleted '1_1-outcome-1_0/yearlyrep'.
Now you can delete the project.
armadillo.delete_project("cohort1")
#> Deleted project 'cohort1'