Bioinformatics

How to construct a spatial object in Seurat

Sign up for my newsletter to not miss a post like this https://divingintogeneticsandgenomics.ck.page/newsletter Single-cell spatial transcriptome data is a new and advanced technology that combines the study of individual cells’ genes and their location in a tissue to understand the complex cellular and molecular differences within it. This allows scientists to investigate how genes are expressed and how cells interact with each other with much greater detail than before.

Deep learning to predict cancer from healthy controls using TCRseq data

Sign up for my newsletter to not miss a post like this https://divingintogeneticsandgenomics.ck.page/newsletter The T-cell receptor (TCR) is a special molecule found on the surface of a type of immune cell called a T-cell. Think of T-cells like soldiers in your body’s defense system that help identify and attack foreign invaders like viruses and bacteria. The TCR is like a sensor or antenna that allows T-cells to recognize specific targets, kind of like how a key fits into a lock.

How to deal with overplotting without being fooled

Sign up for my newsletter to not miss a post like this https://divingintogeneticsandgenomics.ck.page/newsletter The problem Let me be clear, when you have gazillions of data points in a scatter plot, you want to deal with the overplotting to avoid drawing misleading conclusions. Let’s start with a single-cell example. Load the libraries: library(dplyr) library(Seurat) library(patchwork) library(ggplot2) library(ComplexHeatmap) library(SeuratData) set.seed(1234) prepare the data data("pbmc3k") pbmc3k #> An object of class Seurat #> 13714 features across 2700 samples within 1 assay #> Active assay: RNA (13714 features, 0 variable features) ## routine processing pbmc3k<- pbmc3k %>% NormalizeData(normalization.

How to do gene correlation for single-cell RNAseq data (part 1)

Load libraries library(dplyr) library(Seurat) library(patchwork) library(ggplot2) library(ComplexHeatmap) library(SeuratData) set.seed(1234) prepare the data data("pbmc3k") pbmc3k #> An object of class Seurat #> 13714 features across 2700 samples within 1 assay #> Active assay: RNA (13714 features, 0 variable features) ## routine processing pbmc3k<- pbmc3k %>% NormalizeData(normalization.method = "LogNormalize", scale.factor = 10000) %>% FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>% ScaleData() %>% RunPCA(verbose = FALSE) %>% FindNeighbors(dims = 1:10, verbose = FALSE) %>% FindClusters(resolution = 0.

transpose single-cell cell x gene dataframe to gene x cell

Single cell matrix is often represented as gene x cell in R/Seurat, but it is represented as cell x gene in python/scanpy. Let’s use a real example to show how to transpose between the two formats. The GEO accession page is at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154763 Download the data We can use command line to download the count matrix at ftp: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE154nnn/GSE154763/suppl/ wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE154nnn/GSE154763/suppl/GSE154763_ESCA_normalized_expression.csv.gz -O ~/blog_data/GSE154763_ESCA_normalized_expression.csv.gz # decompress the file gunzip GSE154763_ESCA_normalized_expression.csv.gz # this GEO matrix is cell x gene # take a look by https://www.

How to run dockerized Rstudio server on google cloud

Create a google VM Follow the process using the console https://cloud.google.com/compute/docs/instances/create-start-instance#console_1 Install docker Follow https://docs.docker.com/engine/install/debian/ Note this example for the debian build. If you created your VM using ubuntu as the boot disk, you should follow the ubuntu section https://docs.docker.com/engine/install/ubuntu/. In the GCP VM: sudo apt-get update sudo apt-get install \ ca-certificates \ curl \ gnupg \ lsb-release sudo mkdir -p /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg –dearmor -o /etc/apt/keyrings/docker.

Are PDL1 RNA and protein levels correlated in cancer cell lines?

Are protein and RNA levels correlated? This is a big question. see replies to this tweet at https://twitter.com/slavov_n/status/1561844133496512512. In general, RNA and protein abundances should be correlated but there are exceptions of course. Biology is complicated/weird! One of my favorite examples is Hypoxia-inducible factor 1-alpha, HIF-1α. The protein is efficiently degraded in most tissues most of the time unless stabilized by hypoxia.

use random forest and boost trees to find marker genes in scRNAseq data

This is a blog post for a series of posts on marker gene identification using machine learning methods. Read the previous posts: logistic regression and partial least square regression. This blog post will explore the tree based method: random forest and boost trees (gradient boost tree/XGboost). I highly recommend going through https://app.learney.me/maps/StatQuest for related sections by Josh Starmer. Note, all the tree based methods can be used to do both classification and regression.

Partial least square regression for marker gene identification in scRNAseq data

This is an extension of my last blog post marker gene selection using logistic regression and regularization for scRNAseq. Let’s use the same PBMC single-cell RNAseq data as an example. Load libraries library(Seurat) library(tidyverse) library(tidymodels) library(scCustomize) # for plotting library(patchwork) Preprocess the data

Load the PBMC dataset pbmc.data <- Read10X(data.dir = "~/blog_data/filtered_gene_bc_matrices/hg19/") # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.

marker gene selection using logistic regression and regularization for scRNAseq

why this blog post? I saw a biorxiv paper titled A comparison of marker gene selection methods for single-cell RNA sequencing data Our results highlight the efficacy of simple methods, especially the Wilcoxon rank-sum test, Student’s t-test and logistic regression I am interested in using logistic regression to find marker genes and want to try fitting the model in the tidymodel ecosystem and using different regularization methods.