Inputs and Outputs
Alexander Dietrich
Source:vignettes/simulator_input_output.Rmd
simulator_input_output.Rmd
library(Seurat)
#> Loading required package: SeuratObject
#> Loading required package: sp
#> 'SeuratObject' was built under R 4.3.0 but the current version is
#> 4.3.3; it is recomended that you reinstall 'SeuratObject' as the ABI
#> for R may have changed
#>
#> Attaching package: 'SeuratObject'
#> The following object is masked from 'package:base':
#>
#> intersect
library(curl)
#> Using libcurl 7.81.0 with OpenSSL/3.0.2
library(SimBu)
This chapter covers the different input and output options of the package in detail.
Input
The input for your simulations is always a
SummarizedExperiment
object. You can create this object
with different constructing functions, which will explained below. It is
also possible to merge multiple datasets objects into one.
Sfaira is not covered in this vignette, but in “Public Data Integration”.
Custom data
Using existing count matrices and annotations is already covered in
the “Getting started”
vignette; this section will explain some minor details.
When generating a dataset with your own data, you need to provide the
count_matrix
parameter of dataset()
;
additionally you can provide a TPM matrix with the
tpm_matrix
. This will then lead to two simulations, one
based on counts and one based on TPMs. For either of them, genes are
located in the rows, cells in the columns.
Additionally, an annotation table is needed, with the cell-type
annotations. It needs to consist of at least out of 2 columns:
ID
and cell_type
, where ID
has to
be identical to the column names of the provides matrix/matrices. If not
all cells appear in the annotation or matrix, the intersection of both
is used to generate the dataset.
Here is some example data:
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell-", rep(1:300))
colnames(tpm) <- paste0("cell-", rep(1:300))
rownames(counts) <- paste0("gene-", rep(1:1000))
rownames(tpm) <- paste0("gene-", rep(1:1000))
annotation <- data.frame(
"ID" = paste0("cell-", rep(1:300)),
"cell_type" = c(
rep("T cells CD4", 50),
rep("T cells CD8", 50),
rep("Macrophages", 100),
rep("NK cells", 10),
rep("B cells", 70),
rep("Monocytes", 20)
),
row.names = paste0("cell-", rep(1:300))
)
seurat_obj <- Seurat::CreateSeuratObject(counts = counts, assay = "gene_expression", meta.data = annotation)
# store normalized matrix in the 'data' layer
SeuratObject::LayerData(seurat_obj, assay = "gene_expression", layer = "data") <- tpm
seurat_obj
#> An object of class Seurat
#> 1000 features across 300 samples within 1 assay
#> Active assay: gene_expression (1000 features, 0 variable features)
#> 2 layers present: counts, data
Seurat
It is possible to use a Seurat object to build a dataset; give the
name of the assay containing count data in the counts
slot,
the name of the column in the meta table with the unique cell IDs and
the name of the column in the meta table with the cell type identifier.
Additionally you may give the name of the assay containing TPM data in
the counts
slot.
ds_seurat <- SimBu::dataset_seurat(
seurat_obj = seurat_obj,
counts_layer = "counts",
cell_id_col = "ID",
cell_type_col = "cell_type",
tpm_layer = "data",
name = "seurat_dataset"
)
#> Filtering genes...
#> Created dataset.
h5ad files
It is possible to use an h5ad file directly, a file format which
stores AnnData
objects. As h5ad files can store cell specific information in the
obs
layer, no additional annotation input for SimBu is
needed.
Note: if you want both counts and tpm data as input, you will have to
provide two files; the cell annotation has to match between these two
files. As SimBu expects the cells to be in the columns and
genes/features in the rows of the input matrix, but this is not
necessarily the case for anndata objects https://falexwolf.de/img/scanpy/anndata.svg,
SimBu can handle h5ad files with cells in the obs
or
var
layer. If your cells are in obs
, use
cells_in_obs=TRUE
and FALSE
otherwise. This
will also automatically transpose the matrix. To know, which columns in
the cell annotation layer correspond to the cell identifiers and cell
type labels, use the cell_id_col
and
cell_type_col
parameters, respectively.
As this function uses the SimBu
python environment to
read the h5ad files and extract the data, it may take some more time to
initialize the conda environment at the first usage only.
# example h5ad file, where cell type info is stored in `obs` layer
# h5 <- system.file("extdata", "anndata.h5ad", package = "SimBu")
# ds_h5ad <- SimBu::dataset_h5ad(
# h5ad_file_counts = h5,
# name = "h5ad_dataset",
# cell_id_col = 0, # this will use the rownames of the metadata as cell identifiers
# cell_type_col = "group", # this will use the 'group' column of the metadata as cell type info
# cells_in_obs = TRUE # in case your cell information is stored in the var layer, switch to FALSE
# )
Merging datasets
You are able to merge multiple datasets by using the
dataset_merge
function:
ds <- SimBu::dataset(
annotation = annotation,
count_matrix = counts,
tpm_matrix = tpm,
name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.
ds_multiple <- SimBu::dataset_merge(
dataset_list = list(ds_seurat, ds),
name = "ds_multiple"
)
#> Filtering genes...
#> Created dataset.
Output
The simulation
object contains three named
entries:
-
bulk
: a SummarizedExperiment object with the pseudo-bulk dataset(s) stored in theassays
. They can be accessed like this:
simulation <- SimBu::simulate_bulk(
data = ds_multiple,
scenario = "random",
scaling_factor = "NONE",
nsamples = 10, ncells = 100
)
#> Finished simulation.
dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_counts"]])
#> [1] 1000 10
dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_tpm"]])
#> [1] 1000 10
If only the count matrix was given to the dataset initially, only the
bulk_counts
assay is filled.
cell_fractions
: a table where rows represent the simulated samples and columns represent the different simulated cell-types. The entries in the table store the specific cell-type fraction per sample.scaling_vector
: a named list, with the used scaling value for each cell from the single cell dataset.
utils::sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] SimBu_1.5.4 curl_5.2.1 Seurat_5.0.3 SeuratObject_5.0.1
#> [5] sp_2.1-3
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.8
#> [3] magrittr_2.0.3 spatstat.utils_3.0-4
#> [5] rmarkdown_2.26 zlibbioc_1.48.2
#> [7] fs_1.6.3 ragg_1.3.0
#> [9] vctrs_0.6.5 ROCR_1.0-11
#> [11] memoise_2.0.1 spatstat.explore_3.2-7
#> [13] RCurl_1.98-1.14 S4Arrays_1.2.1
#> [15] htmltools_0.5.8 SparseArray_1.2.4
#> [17] sass_0.4.9 sctransform_0.4.1
#> [19] parallelly_1.37.1 KernSmooth_2.23-22
#> [21] bslib_0.6.2 htmlwidgets_1.6.4
#> [23] desc_1.4.3 ica_1.0-3
#> [25] plyr_1.8.9 plotly_4.10.4
#> [27] zoo_1.8-12 cachem_1.0.8
#> [29] igraph_2.0.3 mime_0.12
#> [31] lifecycle_1.0.4 pkgconfig_2.0.3
#> [33] Matrix_1.6-5 R6_2.5.1
#> [35] fastmap_1.1.1 GenomeInfoDbData_1.2.11
#> [37] MatrixGenerics_1.14.0 fitdistrplus_1.1-11
#> [39] future_1.33.2 shiny_1.8.1
#> [41] digest_0.6.35 colorspace_2.1-0
#> [43] S4Vectors_0.40.2 patchwork_1.2.0
#> [45] tensor_1.5 RSpectra_0.16-1
#> [47] irlba_2.3.5.1 GenomicRanges_1.54.1
#> [49] textshaping_0.3.7 progressr_0.14.0
#> [51] fansi_1.0.6 spatstat.sparse_3.0-3
#> [53] httr_1.4.7 polyclip_1.10-6
#> [55] abind_1.4-5 compiler_4.3.3
#> [57] BiocParallel_1.36.0 fastDummies_1.7.3
#> [59] MASS_7.3-60.0.1 DelayedArray_0.28.0
#> [61] proxyC_0.3.4 tools_4.3.3
#> [63] lmtest_0.9-40 httpuv_1.6.15
#> [65] future.apply_1.11.1 goftest_1.2-3
#> [67] glue_1.7.0 nlme_3.1-164
#> [69] promises_1.2.1 grid_4.3.3
#> [71] Rtsne_0.17 cluster_2.1.6
#> [73] reshape2_1.4.4 generics_0.1.3
#> [75] gtable_0.3.4 spatstat.data_3.0-4
#> [77] tidyr_1.3.1 data.table_1.15.2
#> [79] XVector_0.42.0 utf8_1.2.4
#> [81] BiocGenerics_0.48.1 spatstat.geom_3.2-9
#> [83] RcppAnnoy_0.0.22 ggrepel_0.9.5
#> [85] RANN_2.6.1 pillar_1.9.0
#> [87] stringr_1.5.1 spam_2.10-0
#> [89] RcppHNSW_0.6.0 later_1.3.2
#> [91] splines_4.3.3 dplyr_1.1.4
#> [93] lattice_0.22-5 survival_3.5-8
#> [95] deldir_2.0-4 tidyselect_1.2.1
#> [97] miniUI_0.1.1.1 pbapply_1.7-2
#> [99] knitr_1.45 gridExtra_2.3
#> [101] IRanges_2.36.0 SummarizedExperiment_1.32.0
#> [103] scattermore_1.2 stats4_4.3.3
#> [105] xfun_0.43 Biobase_2.62.0
#> [107] matrixStats_1.2.0 stringi_1.8.3
#> [109] lazyeval_0.2.2 yaml_2.3.8
#> [111] evaluate_0.23 codetools_0.2-19
#> [113] tibble_3.2.1 cli_3.6.2
#> [115] RcppParallel_5.1.7 uwot_0.1.16
#> [117] xtable_1.8-4 reticulate_1.35.0
#> [119] systemfonts_1.0.6 munsell_0.5.0
#> [121] jquerylib_0.1.4 GenomeInfoDb_1.38.8
#> [123] Rcpp_1.0.12 globals_0.16.3
#> [125] spatstat.random_3.2-3 png_0.1-8
#> [127] parallel_4.3.3 pkgdown_2.0.7
#> [129] ggplot2_3.5.0 dotCall64_1.1-1
#> [131] bitops_1.0-7 sparseMatrixStats_1.14.0
#> [133] listenv_0.9.1 viridisLite_0.4.2
#> [135] scales_1.3.0 ggridges_0.5.6
#> [137] crayon_1.5.2 leiden_0.4.3.1
#> [139] purrr_1.0.2 rlang_1.1.3
#> [141] cowplot_1.1.3