Public Data Integration using Sfaira
Alexander Dietrich
Source:vignettes/sfaira_vignette.Rmd
sfaira_vignette.Rmd
Sfaira Integeration
This vignette will cover the integration of the public database
Sfaria.
Setup
As a public database, sfaira (Fischer et al. 2020) is used, which
is a dataset and model repository for single-cell RNA-sequencing data.
It gives access to about multiple datasets from human and mouse with
more than 3 million cells in total. You can browse them interactively
here: https://theislab.github.io/sfaira-portal/Datasets.
Note that only annotated datasets will be downloaded! Also there are
cases of datasets, which have private URLs and cannot be automatically
downloaded; SimBu will skip these datasets.
In order to use this database, we first need to install it. This can
easily be done, by running the setup_sfaira()
function for
the first time. In the background we use the basilisik
package to establish a conda environment that has all sfaira
dependencies installed. The installation will be only performed one
single time, even if you close your R session and call
setup_sfaira()
again. The given directory serves as the
storage for all future downloaded datasets from sfaira:
setup_list <- SimBu::setup_sfaira(basedir = tempdir())
Creating a dataset
We will now create a dataset of samples from human pancreas using the
organisms
and tissues
parameter. You can
provide a single word (like we do here) or for example a list of tissues
you want to download: c("pancreas","lung")
. An additional
parameter is the assays
parameter, where you subset the
database further to only download datasets from certain sequencing
assays (for examples Smart-seq2
).
The name
parameter is used later on to give each sample
(cell) a unique name.
ds_pancrease <- SimBu::dataset_sfaira_multiple(
sfaira_setup = setup_list,
organisms = "Homo sapiens",
tissues = "pancreas",
name = "human_pancreas"
)
Currently there are three datasets in sfaira from human pancreas,
which have cell-type annotation. The package will download them for you
automatically and merge them together into a single expression matrix
and a streamlined annotation table, which we can use for our
simulation.
It can happen, that some datasets from sfaira are not (yet) ready for
the automatic download, an error message will then appear in R, telling
you which file to download and where to put it.
If you wish to see all datasets which are included in sfaira you can use the following command:
all_datasets <- SimBu::sfaira_overview(setup_list = setup_list)
head(all_datasets)
This allows you to find the specific IDs of datasets, which you can download directly:
SimBu::dataset_sfaira(
sfaira_id = "homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x",
sfaira_setup = setup_list,
name = "dataset_by_id"
)
utils::sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] SimBu_1.5.4
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.2 knitr_1.45 rlang_1.1.3
#> [5] xfun_0.43 purrr_1.0.2 textshaping_0.3.7 jsonlite_1.8.8
#> [9] htmltools_0.5.8 ragg_1.3.0 sass_0.4.9 rmarkdown_2.26
#> [13] evaluate_0.23 jquerylib_0.1.4 fastmap_1.1.1 yaml_2.3.8
#> [17] lifecycle_1.0.4 memoise_2.0.1 compiler_4.3.3 fs_1.6.3
#> [21] systemfonts_1.0.6 digest_0.6.35 R6_2.5.1 magrittr_2.0.3
#> [25] bslib_0.6.2 tools_4.3.3 pkgdown_2.0.7 cachem_1.0.8
#> [29] desc_1.4.3