Machine learning

How to use random forest as a clustering method

If you ask me: what’s your favorite machine learning algorithm? I would answer logistic regression (with regularization: Lasso, Ridge and Elastic) followed by random forest. In fact, that’s how we try those methods in order. Deep learning can perform well for tabular data with complicated architecture while random forest or boost tree based method usually work well out of the box. Regression and random forest are more interpretable too.

scRNAseq clustering significance test: an unsolvable problem?

Introductioon In scRNA-seq data analysis, one of the most crucial and demanding tasks is determining the optimal resolution and cluster number. Achieving an appropriate balance between over-clustering and under-clustering is often intricate, as it directly impacts the identification of distinct cell populations and biological insights. The clustering algorithms have many parameters to tune and it can generate more clusters if e.g., you increase the resolution parameter. However, whether the newly generated clusters are meaningful or not is a question.

Has AI changed the course of Drug Development?

What’s the drug development process? Has AI changed the course of Drug Development? To answer this question, we need first to understand the drug development process. The whole process includes the following: target identification target pharmacology and biomarker development lead identification, lead optimization Clinical research & development regulatory review of IND (investigational new drug) and later phase clinical trials post-marketing knowledge Biologics/antibodies drug development follows a similar path (you can find the map in the same link).

How to classify MNIST images with convolutional neural network

Introduction An artificial intelligence system called a convolutional neural network (CNN) has gained a lot of popularity recently. For jobs like image recognition, where we want to teach a computer to recognize things in a picture, they are especially well suited. CNNs operate by dissecting an image into increasingly minute components, or “features.” The network then examines each feature and searches for patterns shared by various objects. For instance, a CNN might come to understand that some pixel patterns are frequently linked to faces, while others are linked to vehicles or trees.

Basic tensor/array manipulations in R

Sign up for my newsletter to not miss a post like this In my last post, I showed you how to build a neural network in Keras with less than 20 lines of code. One of the key road blocks for beginners is to transform the input to the right shape of tensor (the deep learning terminology) or array (the R built-in type). In this post, I am going to show you some basic manipulations of the array.

Deep learning with Keras using MNIST dataset

Sign up for my newsletter to not miss a post like this Introduction Are you a machine learning practitioner or data analyst looking to broaden your skill set? Look nowhere else! This blog post will offer an introduction to deep learning, which is currently the hottest topic in machine learning. Using the well-known MNIST dataset) and the Keras package, we will investigate the potential of deep learning.

Partial least square regression for marker gene identification in scRNAseq data

This is an extension of my last blog post marker gene selection using logistic regression and regularization for scRNAseq. Let’s use the same PBMC single-cell RNAseq data as an example. Load libraries library(Seurat) library(tidyverse) library(tidymodels) library(scCustomize) # for plotting library(patchwork) Preprocess the data

Load the PBMC dataset <- Read10X(data.dir = "~/blog_data/filtered_gene_bc_matrices/hg19/") # Initialize the Seurat object with the raw (non-normalized data). pbmc <- CreateSeuratObject(counts =, project = "pbmc3k", min.

marker gene selection using logistic regression and regularization for scRNAseq

why this blog post? I saw a biorxiv paper titled A comparison of marker gene selection methods for single-cell RNA sequencing data Our results highlight the efficacy of simple methods, especially the Wilcoxon rank-sum test, Student’s t-test and logistic regression I am interested in using logistic regression to find marker genes and want to try fitting the model in the tidymodel ecosystem and using different regularization methods.