How to level up Real-life bioinformatics skill: from dealing with one sample to a lot of samples

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. The other day, I saw this tweet: Machine learning and bioinformatics tutorials these days — Ramon Massoni Badosa (@rmassonix) May 15, 2024 Many of the bioinformatics tutorials are like that. I am not saying the tutorial is not good. For beginners, we need something basic first to understand it.

How to separate a comma delimited string into multiple lines in R and python

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics. The problem df<- data.frame(id = c(1,2,3), value = c('x,y', 'z,w', 'a')) df #> id value #> 1 1 x,y #> 2 2 z,w #> 3 3 a we want to put x,y in the first row into two rows: 1, x 1, y and put z,w into two rows too. solution with R There is a neat function separate_rows that does exactly this in tidyr package:

S3 and S4 objects in R explained

In R, S3 and S4 objects are related to object-oriented programming (OOP), which allows you to create custom data structures with associated behaviors and methods. Let me explain them using simple language and metaphors, along with practical examples. S3 Objects Imagine you have a collection of toys, like cars, dolls, and action figures. Each toy has its own set of properties (color, size, material) and behaviors (move, make sounds, etc.

Part 4 CITE-seq normalization using empty droplets with the dsb package

In this post, we are going to try a CITE-seq normalization method called dsb published in Normalizing and denoising protein expression data from droplet-based single cell profiling two major components of protein expression noise in droplet-based single cell experiments: (1) protein-specific noise originating from ambient, unbound antibody encapsulated in droplets that can be accurately estimated via the level of “ambient” ADT counts in empty droplets, and (2) droplet/cell-specific noise revealed via the shared variance component associated with isotype antibody controls and background protein counts in each cell.

Generative AI: Text generation using Long short-term memory (LSTM) model

In the world of deep learning, generating sequence data is a fundamental task. Typically, this involves training a network, often an RNN (Recurrent Neural Network) or a convnet (Convolutional Neural Network), to predict the next token or a sequence of tokens in a given sequence, using the preceding tokens as input. For example, when provided with the input “the cat is on the ma,” the network’s objective is to predict the next character, such as ‘t.

use random forest and boost trees to find marker genes in scRNAseq data

This is a blog post for a series of posts on marker gene identification using machine learning methods. Read the previous posts: logistic regression and partial least square regression. This blog post will explore the tree based method: random forest and boost trees (gradient boost tree/XGboost). I highly recommend going through for related sections by Josh Starmer. Note, all the tree based methods can be used to do both classification and regression.

compare slopes in linear regression

I asked this question on twitter. load the package library(tidyverse) make some dummy data The dummy example: We have two groups of samples: disease and health. We treat those cells in vitro with different dosages (0, 1, 5) of a chemical X and count the cell number after 3 hours. x <- tibble( '0' = c(8.66, 11.50, 7.01, 13.40, 11.30, 8.13, 5.92, 7.54), '1' = c(22.10, 23.00, 22.00, 35.70, 32.

Monty Hall problem- a peek through simulation

I am taking this STATE-80 course from Harvard Extension School. This course teaches commonly used distributions and probability theory. The instructor Hatch is a really good teacher and he uses simulation for all the demonstrations along with the formulas. In week 6, we revisited the Monty Hall problem which we played on the first day of class. If you have not heard about it, I quoted from the wiki: Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats.

Align multiple ggplot2 plots by axis

I used to use cowplot to align multiple ggplot2 plots but when the x-axis are of different ranges, some extra work is needed to align the axis as well. The other day I was reading a blog post by GuangChuang Yu and he exactly tackled this problem. His packages such as ChIPseeker, ClusterProfiler, ggtree are quite popular among the users. Some dummy example from his post: library(dplyr) library(ggplot2) library(ggstance) library(cowplot) # devtools::install_github("YuLab-SMU/treeio") # devtools::install_github("YuLab-SMU/ggtree") library(tidytree) library(ggtree) no_legend=theme(legend.