You need to master it if you deal with genomics data

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

Motivation

What’s the most common problem you need to solve when dealing with genomics data?

For me, it is Genomic Intervals!

The genomics data usually represents linearly: chromosome name, start and end.

We use it to define a region in the genome ( A peak from ChIP-seq data); the location of a gene, a DNA methylation site ( a single point), a mutation call (a single point), and a duplication region in cancer etc.

When I first started to learn programming 12 years ago in a wet lab, my task was to find where a set of peaks (from ChIP-seq) bind to genes. To solve this, we have two files (dummy example below):

a peak file with chr, start, end for each row

chr1 200 300 peak1
chr2 400 500 peak2
chr3 456 888 peak3
.....

A gene file also has chr, start, end, and name for each row denoting the gene’s transcription start sites (TSS) + 50 bp upstream and downstream:

chr1 250 350 gene1
chr2 600 700 gene2
chr3 700 800 gene3
....

The task is “easy”, find the overlaps of those two files (in fact, you can eyeball this example, peak1 binds to gene1, peak3 binds to gene3)!

As a beginner, I did not know much, so I read in both files with Python, loop over the lines and compare.

For two regions to overlap, chr should be the same and there could be the following 4 conditions:


                 start1 ------------------end1
           start2----------------end2



                start1 ------------------end1
                           start2----------------end2



              start1 ---------------------end1
                     start2---------end2



                 start1 ---------------end1
              start2---------------------------end2

These are a lot of conditions to compare! Instead, we can find the conditions that the two regions do not overlap:

                                     start1 ------------------end1
          start2----------------end2


            start 1 ------------------end1
                                           start2----------------end2

The comparison will be: if NOT ((start2 > end1) or (start1 > end2)): then two regions overlap! My brute force method works! and I felt accomplished as a beginner.

As I become a little more experienced, I get to know the interval tree data structure which makes those types of comparisons much faster and efficient.

In 2010, bedtools was published! and in a single command (bedtools intersect) you can accomplish what I did with my Python script.

Remember, I wrote my script in 2012, two years after bedtools was published.

The problem was I did not know this tool even existed!

As a beginner, ignorance of what’s out there is the price to pay. (wink, follow me on X https://x.com/tangming2005, I tweet tools and papers)

My story was not alone. Last Thursday, I had the pleasure and honor to have dinner with Dr. Ting Wang. We invited him to give a talk at our company.

He told me that in the early days, he wrote a Perl script to do the intersection of genomic regions and found TP53 binds to Transposable elements(TE). see his paper https://pubmed.ncbi.nlm.nih.gov/18003932/ in 2007. His lab has published 3 papers on Nature Genetics on TE. e.g., Pan-cancer analysis identifies tumor-specific antigens derived from transposable elements.

Of course, Ting was a formally trained PhD in bioinformatics and I am sure his Perl script is much better than my crappy one.

But this tells you how common this type of analysis is, and bedtools comes to the rescue in 2010.

Bonus: watch this video on how I use conda to install bedtools.

Later, I started to learn more about the Bioconductor ecosystem and learned the GenomicRanges package which is the foundation of dealing with genomic intervals in R.

You can work with bedtools on command line. But if you want to work within R, let’s use GenomicRanges. Before that, we need to know IRanges upon which GenomicRanges is built.

Introduction to IRanges

In Biconductor, the IRanges implements the Interval data structure.

Let’s take a look at some examples

library(IRanges)
ir <- IRanges(c(1, 8, 14, 15, 19), 
              width=c(12, 6, 6, 15, 6))

ir2 <- IRanges(c(5, 9, 12, 25, 50), 
              width=c(3, 6, 6, 15, 6))

ir 
#> IRanges object with 5 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         1        12        12
#>   [2]         8        13         6
#>   [3]        14        19         6
#>   [4]        15        29        15
#>   [5]        19        24         6
ir2
#> IRanges object with 5 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         5         7         3
#>   [2]         9        14         6
#>   [3]        12        17         6
#>   [4]        25        39        15
#>   [5]        50        55         6

An IRanges object has start, end and width.

Let’s visualize them.

plotRanges <- function(x, xlim=x, main=deparse(substitute(x)),
                        col="black", sep=0.5, ...) {
         height <- 1
         if (is(xlim, "IntegerRanges"))
                 xlim <- c(min(start(xlim)), max(end(xlim)))
         bins <- disjointBins(IRanges(start(x), end(x) + 1))
        plot.new()
        plot.window(xlim, c(0, max(bins)*(height + sep)))
        ybottom <- bins * (sep + height) - height
        rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col=col, ...)
        title(main)
        axis(1)
 }

plotRanges(ir)

plotRanges(ir2)

some useful functions from the package. Check the visualization to understand the output of each function.

intersect:

intersect(ir, ir2)
#> IRanges object with 3 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         5         7         3
#>   [2]         9        17         9
#>   [3]        25        29         5
plotRanges(intersect(ir, ir2))

reduce:

plotRanges(ir2)

reduce(ir2)
#> IRanges object with 4 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         5         7         3
#>   [2]         9        17         9
#>   [3]        25        39        15
#>   [4]        50        55         6
plotRanges(reduce(ir2))

union:

union(ir, ir2)
#> IRanges object with 2 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         1        39        39
#>   [2]        50        55         6
plotRanges(union(ir, ir2))

setdiff(ir, ir2)
#> IRanges object with 3 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         1         4         4
#>   [2]         8         8         1
#>   [3]        18        24         7
plotRanges(setdiff(ir, ir2))

flank generates flanking ranges for each range in x. If start is TRUE for a given range, the flanking occurs at the start, otherwise the end. The widths of the flanks are given by the width parameter.

flank(ir, width = 2)
#> IRanges object with 5 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]        -1         0         2
#>   [2]         6         7         2
#>   [3]        12        13         2
#>   [4]        13        14         2
#>   [5]        17        18         2
plotRanges(flank(ir, width=2))

shift shifts all the ranges in x by the amount specified by the shift argument.

shift(ir2, shift=2)
#> IRanges object with 5 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         7         9         3
#>   [2]        11        16         6
#>   [3]        14        19         6
#>   [4]        27        41        15
#>   [5]        52        57         6
plotRanges(ir2)

plotRanges(shift(ir2, shift=2))

resize resizes the ranges to the specified width where either the start, end, or center is used as an anchor.

resize(ir, width = 2, fix="start")
#> IRanges object with 5 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         1         2         2
#>   [2]         8         9         2
#>   [3]        14        15         2
#>   [4]        15        16         2
#>   [5]        19        20         2
plotRanges(resize(ir, width = 2, fix="start"))

precede(x, subject, select=c("first", "all")): For each range in x, precede returns the index of the interval in subject that is directly preceded by the query range. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.

precede(ir, ir2)
#> [1] 4 4 4 5 4

follow(x, subject, select=c("last", "all")): The opposite of precede, this function returns the index of the range in subject that a query range in x directly follows. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject

follow(ir, ir2)
#> [1] NA  1  1  2  3

gaps and disjoin:

plotRanges(ir2)

gaps(ir2)
#> IRanges object with 3 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         8         8         1
#>   [2]        18        24         7
#>   [3]        40        49        10
plotRanges(gaps(ir2))

disjoin(ir2)
#> IRanges object with 6 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         5         7         3
#>   [2]         9        11         3
#>   [3]        12        14         3
#>   [4]        15        17         3
#>   [5]        25        39        15
#>   [6]        50        55         6
plotRanges(disjoin(ir2))

The most common operation is to find the overlaps.

overlap<- findOverlaps(ir, ir2)
overlap
#> Hits object with 9 hits and 0 metadata columns:
#>       queryHits subjectHits
#>       <integer>   <integer>
#>   [1]         1           1
#>   [2]         1           2
#>   [3]         1           3
#>   [4]         2           2
#>   [5]         2           3
#>   [6]         3           2
#>   [7]         3           3
#>   [8]         4           3
#>   [9]         4           4
#>   -------
#>   queryLength: 5 / subjectLength: 5

overlap is a Hits object.

queryHits(overlap)
#> [1] 1 1 1 2 2 3 3 4 4
subjectHits(overlap)
#> [1] 1 2 3 2 3 2 3 3 4

It returns the indices of the query hits and the subject hits.

You can turn it to a matrix:

as.matrix(overlap)
#>       queryHits subjectHits
#>  [1,]         1           1
#>  [2,]         1           2
#>  [3,]         1           3
#>  [4,]         2           2
#>  [5,]         2           3
#>  [6,]         3           2
#>  [7,]         3           3
#>  [8,]         4           3
#>  [9,]         4           4

distance returns the pair-wise distance of two GenomicRanges. The distance method is symmetric; y cannot be missing. If x and y are not the same length, the shortest will be recycled to match the length of the longest. The select argument is not available for distance because comparisons are made in a pair-wise fashion. The return value is the length of the longest of x and y.

distance(ir, ir2)
#> [1]  0  0  0  0 25

nearest performs conventional nearest neighbor finding. Returns an integer vector containing the index of the nearest neighbor range in subject for each range in query:

nearest(ir, ir2)
#> [1] 1 1 2 2 4

Introduction to GenomicRanges

GenomicRanges is a natural extension of the IRanges, adding chromosome name and the strand information:

#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("GenomicRanges")

library(GenomicRanges)

x <- GRanges("chr1", IRanges(c(2, 9) , c(7, 19)), strand=c("+", "-"))
y <- GRanges("chr1", IRanges(5, 10), strand="-") 

union(x, y)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       2-7      +
#>   [2]     chr1      5-19      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
union(x, y, ignore.strand=TRUE)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      2-19      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
intersect(x, y)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      9-10      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
intersect(x, y, ignore.strand=TRUE)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       5-7      *
#>   [2]     chr1      9-10      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
setdiff(x, y)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       2-7      +
#>   [2]     chr1     11-19      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
setdiff(x, y, ignore.strand=TRUE)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       2-4      *
#>   [2]     chr1     11-19      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Other functions behavior similarily with the ones in IRanges:

flank(x, width = 2)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       0-1      +
#>   [2]     chr1     20-21      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
promoters(x, upstream = 2, downstream = 2)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       0-3      +
#>   [2]     chr1     18-21      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Notice that it takes account of the strand information. When the strand is -. the promoter returns the range around the end position.

You can use resize to return the transcription start site:

resize(x, width=1, fix = "start")
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1         2      +
#>   [2]     chr1        19      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
findOverlaps(x, y)
#> Hits object with 1 hit and 0 metadata columns:
#>       queryHits subjectHits
#>       <integer>   <integer>
#>   [1]         2           1
#>   -------
#>   queryLength: 2 / subjectLength: 1

A real-world example

Let’s see how can we use it for a real-world bioinformatics problem.

Go to GSE120885:

Download HIF1a ChIP-seq peaks at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3417778

Download HIF2a ChIP-seq peaks at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3417780

You should have two files downloaded

# HIF1a
GSM3417778_6h_0.5__PM14_1_peaks.bed.gz 

# HIF2a
GSM3417780_6h_0.5__PM9_1_peaks.bed.gz

HIF1α (Hypoxia-Inducible Factor 1α) and HIF2α (Hypoxia-Inducible Factor 2α) are transcription factors that help cells respond to low oxygen (hypoxia). HIF1α regulates genes involved in oxygen transport and metabolism, while HIF2α plays a role in more specific processes like stem cell function and vascular development. Both proteins are crucial in cancer biology, as they enable tumors to adapt to low oxygen conditions, promoting growth and survival under stress.

Let’s ask simple questions:

  • How many HIF1a peaks overlap with HIF2a peaks?
  • How many HIF1a peaks located at the promoter regions? (Note the peaks coordinates are hg19 version)
  • How to get the nearest genes of the HIF1a peaks?

We can use rtracklayer to load the bed files.

# BiocManager::install("rtracklayer")
library(rtracklayer)
HIF1_peaks<- import("~/blog_data/GSM3417778_6h_0.5__PM14_1_peaks.bed.gz")
HIF2_peaks<- import("~/blog_data/GSM3417780_6h_0.5__PM9_1_peaks.bed.gz")

HIF1_peaks
#> GRanges object with 1496 ranges and 2 metadata columns:
#>          seqnames              ranges strand |           name     score
#>             <Rle>           <IRanges>  <Rle> |    <character> <numeric>
#>      [1]     chr1       763060-763168      * |    MACS_peak_1     60.05
#>      [2]     chr1       954905-955303      * |    MACS_peak_2    106.52
#>      [3]     chr1     5439920-5440090      * |    MACS_peak_3     51.16
#>      [4]     chr1     5765488-5765761      * |    MACS_peak_4     57.23
#>      [5]     chr1     5882033-5882303      * |    MACS_peak_5     50.23
#>      ...      ...                 ...    ... .            ...       ...
#>   [1492]     chrX   80823360-80823533      * | MACS_peak_1492     59.37
#>   [1493]     chrX 133683313-133683581      * | MACS_peak_1493    119.01
#>   [1494]     chrX 151143115-151143368      * | MACS_peak_1494     93.84
#>   [1495]     chrX 152582879-152583036      * | MACS_peak_1495     53.76
#>   [1496]     chrX 153606352-153606601      * | MACS_peak_1496     68.16
#>   -------
#>   seqinfo: 29 sequences from an unspecified genome; no seqlengths
HIF2_peaks
#> GRanges object with 1417 ranges and 2 metadata columns:
#>          seqnames              ranges strand |           name     score
#>             <Rle>           <IRanges>  <Rle> |    <character> <numeric>
#>      [1]     chr1     2383154-2383508      * |    MACS_peak_1     82.11
#>      [2]     chr1     2566751-2567041      * |    MACS_peak_2     51.35
#>      [3]     chr1     3459508-3460142      * |    MACS_peak_3     96.47
#>      [4]     chr1     3464750-3465045      * |    MACS_peak_4     51.45
#>      [5]     chr1     4318344-4318487      * |    MACS_peak_5     61.02
#>      ...      ...                 ...    ... .            ...       ...
#>   [1413]     chrX 102862582-102862698      * | MACS_peak_1413     53.25
#>   [1414]     chrX 116435651-116435799      * | MACS_peak_1414     50.49
#>   [1415]     chrX 127756374-127756490      * | MACS_peak_1415     54.87
#>   [1416]     chrX 130129816-130129963      * | MACS_peak_1416     50.70
#>   [1417]     chrX 142617589-142617695      * | MACS_peak_1417     50.34
#>   -------
#>   seqinfo: 26 sequences from an unspecified genome; no seqlengths

How many peaks HIF1a overlaps with HIF2a?

length(HIF1_peaks)
#> [1] 1496
length(HIF2_peaks)
#> [1] 1417
subsetByOverlaps(HIF1_peaks, HIF2_peaks)
#> GRanges object with 93 ranges and 2 metadata columns:
#>              seqnames              ranges strand |           name     score
#>                 <Rle>           <IRanges>  <Rle> |    <character> <numeric>
#>    [1]           chr1     8938337-8939542      * |    MACS_peak_9   1224.68
#>    [2]           chr1   21564937-21565326      * |   MACS_peak_23     90.64
#>    [3]           chr1   23503867-23504391      * |   MACS_peak_24     58.14
#>    [4]           chr1   65614184-65614750      * |   MACS_peak_45    369.22
#>    [5]           chr1 142535453-142535612      * |   MACS_peak_58     66.92
#>    ...            ...                 ...    ... .            ...       ...
#>   [89]           chr8 134387536-134388301      * | MACS_peak_1411    421.12
#>   [90]           chr9   93956056-93956397      * | MACS_peak_1448     57.95
#>   [91]           chr9 121093933-121094280      * | MACS_peak_1468    135.11
#>   [92]           chr9 136202761-136203184      * | MACS_peak_1483     54.57
#>   [93] chrUn_gl000220       118399-126340      * | MACS_peak_1487    101.57
#>   -------
#>   seqinfo: 29 sequences from an unspecified genome; no seqlengths

93 HIF1a peaks overlap with HIF2a peaks

subsetByOverlaps(HIF2_peaks, HIF1_peaks)
#> GRanges object with 94 ranges and 2 metadata columns:
#>              seqnames              ranges strand |           name     score
#>                 <Rle>           <IRanges>  <Rle> |    <character> <numeric>
#>    [1]           chr1     8938913-8939522      * |   MACS_peak_11    251.69
#>    [2]           chr1   21564813-21565350      * |   MACS_peak_33    389.00
#>    [3]           chr1   23503900-23504222      * |   MACS_peak_36     52.22
#>    [4]           chr1   65614200-65614665      * |   MACS_peak_69    143.21
#>    [5]           chr1 142535444-142535599      * |   MACS_peak_93     68.69
#>    ...            ...                 ...    ... .            ...       ...
#>   [90]           chr8 134387445-134388247      * | MACS_peak_1329    224.87
#>   [91]           chr9   93955941-93956494      * | MACS_peak_1369    363.15
#>   [92]           chr9 121094013-121094130      * | MACS_peak_1391     63.00
#>   [93]           chr9 136203030-136203291      * | MACS_peak_1402     51.10
#>   [94] chrUn_gl000220       118416-125420      * | MACS_peak_1410    212.47
#>   -------
#>   seqinfo: 26 sequences from an unspecified genome; no seqlengths

94 HIF2a peaks overlap with HIF1a peaks.

There must be 2 HIF2a peaks overlap with the same HIF1a peak.

# BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(AnnotationDbi)


library(dplyr)
txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene

## get the promoters
hg19_genes<- genes(txdb)

gene_symbols <- AnnotationDbi::select(org.Hs.eg.db, 
                                      keys = hg19_genes$gene_id, 
                                      column = "SYMBOL", 
                                      keytype = "ENTREZID", 
                                      multiVals = "first")

head(gene_symbols)
#>    ENTREZID  SYMBOL
#> 1         1    A1BG
#> 2        10    NAT2
#> 3       100     ADA
#> 4      1000    CDH2
#> 5     10000    AKT3
#> 6 100008586 GAGE12F
# double check the id order is the same
all.equal(gene_symbols$ENTREZID, hg19_genes$gene_id)
#> [1] TRUE
# add the gene symbol to it
hg19_genes$symbol<- gene_symbols$SYMBOL

hg19_promoters<- promoters(hg19_genes, upstream = 2000, downstream = 2000)

hg19_promoters
#> GRanges object with 23056 ranges and 2 metadata columns:
#>         seqnames              ranges strand |     gene_id      symbol
#>            <Rle>           <IRanges>  <Rle> | <character> <character>
#>       1    chr19   58872215-58876214      - |           1        A1BG
#>      10     chr8   18246755-18250754      + |          10        NAT2
#>     100    chr20   43278377-43282376      - |         100         ADA
#>    1000    chr18   25755446-25759445      - |        1000        CDH2
#>   10000     chr1 244004887-244008886      - |       10000        AKT3
#>     ...      ...                 ...    ... .         ...         ...
#>    9991     chr9 115093945-115097944      - |        9991       PTBP3
#>    9992    chr21   35734323-35738322      + |        9992       KCNE2
#>    9993    chr22   19107968-19111967      - |        9993       DGCR2
#>    9994     chr6   90537619-90541618      + |        9994    CASP8AP2
#>    9997    chr22   50962906-50966905      - |        9997        SCO2
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

How many HIF1a peaks overlap with promoters?

subsetByOverlaps(HIF1_peaks, hg19_promoters)
#> GRanges object with 275 ranges and 2 metadata columns:
#>         seqnames              ranges strand |           name     score
#>            <Rle>           <IRanges>  <Rle> |    <character> <numeric>
#>     [1]     chr1       763060-763168      * |    MACS_peak_1     60.05
#>     [2]     chr1       954905-955303      * |    MACS_peak_2    106.52
#>     [3]     chr1     8938337-8939542      * |    MACS_peak_9   1224.68
#>     [4]     chr1   12289915-12290246      * |   MACS_peak_13     70.79
#>     [5]     chr1   14924055-14924254      * |   MACS_peak_17     55.58
#>     ...      ...                 ...    ... .            ...       ...
#>   [271]     chrX   49042798-49043142      * | MACS_peak_1490     66.54
#>   [272]     chrX   77359244-77359848      * | MACS_peak_1491    577.87
#>   [273]     chrX 133683313-133683581      * | MACS_peak_1493    119.01
#>   [274]     chrX 151143115-151143368      * | MACS_peak_1494     93.84
#>   [275]     chrX 153606352-153606601      * | MACS_peak_1496     68.16
#>   -------
#>   seqinfo: 29 sequences from an unspecified genome; no seqlengths

275 peaks are at the promoter region.

what genes are nearby?

nearby_gene_index<- nearest(HIF1_peaks, hg19_promoters)

# there is NAs in the index. 
table(is.na(nearby_gene_index))
#> 
#> FALSE  TRUE 
#>  1494     2
HIF1_peaks<- HIF1_peaks[!is.na(nearby_gene_index)]

# remove the NAs
nearby_gene_index<- nearby_gene_index[!is.na(nearby_gene_index)]

# the nearest genes 
hg19_promoters[nearby_gene_index]
#> GRanges object with 1494 ranges and 2 metadata columns:
#>             seqnames              ranges strand |     gene_id      symbol
#>                <Rle>           <IRanges>  <Rle> | <character> <character>
#>       79854     chr1       760903-764902      - |       79854   LINC00115
#>      375790     chr1       953503-957502      + |      375790        AGRN
#>   100616489     chr1     5622131-5626130      + |   100616489        <NA>
#>   100616489     chr1     5622131-5626130      + |   100616489        <NA>
#>   100616421     chr1     5920802-5924801      - |   100616421     MIR4689
#>         ...      ...                 ...    ... .         ...         ...
#>       79366     chrX   80455442-80459441      - |       79366       HMGN5
#>   100506757     chrX 133682054-133686053      + |   100506757   LINC00629
#>        2564     chrX 151141152-151145151      - |        2564       GABRE
#>       10838     chrX 152597613-152601612      + |       10838      ZNF275
#>        2010     chrX 153605597-153609596      + |        2010         EMD
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome
# distance between the peak and the TSS of the nearest gene
distance_to_peaks<- distance(HIF1_peaks, 
                    resize(hg19_genes, width =1, fix="start")[nearby_gene_index])

combine the peaks and the gene information

peak_df<- as.data.frame(HIF1_peaks) 

peak_df$gene_id<- hg19_promoters[nearby_gene_index]$gene_id

peak_df$symbol<- hg19_promoters[nearby_gene_index]$symbol

peak_df$distance<- distance_to_peaks

head(peak_df)
#>   seqnames   start     end width strand        name  score   gene_id    symbol
#> 1     chr1  763060  763168   109      * MACS_peak_1  60.05     79854 LINC00115
#> 2     chr1  954905  955303   399      * MACS_peak_2 106.52    375790      AGRN
#> 3     chr1 5439920 5440090   171      * MACS_peak_3  51.16 100616489      <NA>
#> 4     chr1 5765488 5765761   274      * MACS_peak_4  57.23 100616489      <NA>
#> 5     chr1 5882033 5882303   271      * MACS_peak_5  50.23 100616421   MIR4689
#> 6     chr1 6824756 6824931   176      * MACS_peak_6  57.10     23261    CAMTA1
#>   distance
#> 1      157
#> 2      199
#> 3   184040
#> 4   141356
#> 5    40497
#> 6    20452

Hoary! We annotated the peaks with the closest genes from scratch!

Final note

You should use well-tested packages such as ChIPseeker to do such analysis, but understanding the basics and the low-level functions is super useful!

Happy Learning!

Tommy

To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.

Related

Previous
comments powered by Disqus