R or Python for Bioinformatics?

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. R or Python for Bioinformatics? Watch the video here: If you need to pick Python or R for bioinformatics, which one should you choose? This is a decades-old question from many beginners. This is my story. I started learning Unix Commands 12 years ago (See an example of how powerful Unix commands can be).

Downstream of bulk RNAseq: read in salmon output using tximport and then DESeq2

Join my newsletter to not miss a post like this In the last blog post, I showed you how to use salmon to get counts from fastq files downloaded from GEO. In this post, I am going to show you how to read in the .sf salmon quantification file into R; how to get the tx2gene.txt file and do DESeq2 for differential gene expression analysis. Let’s dive in! library(tximport) library(dplyr) library(ggplot2) files<- list.

Part 2 CITE-seq downstream analysis: From Alevin output to Seurat visualization

In my last post, I showed you how to get the protein and RNA counts from a CITE-seq experiment using Simpleaf. Now that we have the raw count matrices, we are ready to explore them within R. To follow the tutorial, you can download the associated data from here. Load the data suppressPackageStartupMessages({ library(fishpond) library(ggplot2) library(dplyr) library(SingleCellExperiment) library(Seurat) library(DropletUtils) }) # set the seed set.seed(123) #gex_q <- loadFry('~/blog_data/CITEseq/alevin_rna/af_quant') #fb_q <- loadFry( '~/blog_data/CITEseq/alevin_adt/af_quant') # I saved the above objs first to rds files, now just read them back fb_q<- readRDS("~/blog_data/CITEseq/fb_q.

How to convert raw counts to TPM for TCGA data and make a heatmap across cancer types

Sign up for my newsletter to not miss a post like this The Cancer Genome Atlas (TCGA) project is probably one of the most well-known large-scale cancer sequencing project. It sequenced ~10,000 treatment-naive tumors across 33 cancer types. Different data including whole-exome, whole-genome, copy-number (SNP array), bulk RNAseq, protein expression (Reverse-Phase Protein Array), DNA methylation are available. TCGA is a very successful large sequencing project. I highly recommend learning from the organization of it.

How to create pseudobulk from single-cell RNAseq data

What is pseduobulk? Many of you have heard about bulk-RNAseq data. What is pseduobulk? Single-cell RNAseq can profile the gene expression at single-cell resolution. For differential expression, psedobulk seems to perform really well(see paper muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data). To create a pseudobulk, one can artificially add up the counts for cells from the same cell type of the same sample. In this blog post, I’ll guide you through the art of creating pseudobulk data from scRNA-seq experiments.

How to make a triangle correlation heatmap with p-values labeled

In this blog post, I am going to show you how to make a correlation heatmap with p-values and significant values labeled in the heatmap body. Let’s use the PBMC single cell data as an example. You may want to read my previous blog post How to do gene correlation for single-cell RNAseq data. Load libraries library(dplyr) library(Seurat) library(patchwork) library(ggplot2) library(ComplexHeatmap) library(SeuratData) library(hdWGCNA) library(WGCNA) set.seed(1234) prepare the data data("pbmc3k") pbmc3k #> An object of class Seurat #> 13714 features across 2700 samples within 1 assay #> Active assay: RNA (13714 features, 0 variable features) ## routine processing pbmc3k<- pbmc3k %>% NormalizeData(normalization.

How to do gene correlation for single-cell RNAseq data (part 2) using meta-cell

In my last blog post, I showed that pearson gene correlation for single-cell data has flaws because of the sparsity of the count matrix. One way to get around it is to use the so called meta-cell. One can use KNN to find the K nearest neighbors and collapse them into a meta-cell. You can implement it from scratch, but one should not re-invent the wheel. For example, you can use metacells.

Partial least square regression for marker gene identification in scRNAseq data

This is an extension of my last blog post marker gene selection using logistic regression and regularization for scRNAseq. Let’s use the same PBMC single-cell RNAseq data as an example. Load libraries library(Seurat) library(tidyverse) library(tidymodels) library(scCustomize) # for plotting library(patchwork) Preprocess the data

Load the PBMC dataset <- Read10X(data.dir = "~/blog_data/filtered_gene_bc_matrices/hg19/") # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts =, project = "pbmc3k", min.

Obtain metadata for public datasets in GEO

There are so many public datasets there waiting for us to mine! It is the blessing and cursing as a computational biologist! Metadata, or the data describing (e.g., responder or non-responder for the treatment) the data are critical in interpreting the analysis. Without metadata, your data are useless. People usually go to GEO or ENA to download public data. I asked this question on twitter, and I will show you how to get the metadata as suggested by all the awesome tweeps.

Develop Bioconductor packages with docker container

Readings links to read: use container I am following the last link. pull the container docker pull bioconductor/bioconductor_full:devel docker images REPOSITORY TAG IMAGE ID CREATED SIZE bioconductor/bioconductor_full devel ae3ec2be7376 3 hours ago 5.7GB seuratv3 latest 9b358ab1fd63 2 days ago 2.76GB It is 5.7G in size. start the Rstuido from the image. I have another Rstudio instance using port 8787, let me use a different one (e.