RNAseq

PCA analysis on TCGA bulk RNAseq data continued

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. In my last blog post, I showed you how to download TCGA RNAseq count data and do PCA and make a heatmap. It is interesting to see some of the LUSC samples mix with the LUAD samples and vice versa. In this post, we will continue to use PCA to do more Exploratory data analysis (EDA).

PCA analysis on TCGA bulk RNAseq data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. what is PCA? Principal Component Analysis (PCA) is a mathematical technique used to reduce the dimensionality of large datasets while preserving the most important patterns in the data. It transforms the original high-dimensional data into a smaller set of new variables called principal components (PCs), which capture the most variation in the data.

Downstream of bulk RNAseq: read in salmon output using tximport and then DESeq2

Join my newsletter to not miss a post like this In the last blog post, I showed you how to use salmon to get counts from fastq files downloaded from GEO. In this post, I am going to show you how to read in the .sf salmon quantification file into R; how to get the tx2gene.txt file and do DESeq2 for differential gene expression analysis. Let’s dive in! library(tximport) library(dplyr) library(ggplot2) files<- list.

How to preprocess GEO bulk RNAseq data with salmon

Install fastq-dl To easily download fastq from GEO or ENA, use fastq-dl Assume you already have conda installed, do the following: conda config –add channels conda-forge conda config –add channels bioconda conda create -n fastq_download -c conda-forge -c bioconda fastq-dl conda activate fastq_download Tip: use mamba if conda is too slow for you. They are all big snakes!! We will use bulk RNAseq data from this GEO accession ID: https://www.

How to convert raw counts to TPM for TCGA data and make a heatmap across cancer types

Sign up for my newsletter to not miss a post like this https://divingintogeneticsandgenomics.ck.page/newsletter The Cancer Genome Atlas (TCGA) project is probably one of the most well-known large-scale cancer sequencing project. It sequenced ~10,000 treatment-naive tumors across 33 cancer types. Different data including whole-exome, whole-genome, copy-number (SNP array), bulk RNAseq, protein expression (Reverse-Phase Protein Array), DNA methylation are available. TCGA is a very successful large sequencing project. I highly recommend learning from the organization of it.