To not miss a post like this, sign up for my newsletter to learn computational biology and bioinformatics.
The problem
df<- data.frame(id = c(1,2,3), value = c('x,y', 'z,w', 'a'))
df
#> id value
#> 1 1 x,y
#> 2 2 z,w
#> 3 3 a
we want to put x,y in the first row into two rows:
1, x
1, y
and put z,w into two rows too.
solution with R
There is a neat function separate_rows
that does exactly this in tidyr
package:
library(tidyr)
library(dplyr)
df %>%
tidyr::separate_rows(value, sep=",")
#> # A tibble: 5 × 2
#> id value
#> <dbl> <chr>
#> 1 1 x
#> 2 1 y
#> 3 2 z
#> 4 2 w
#> 5 3 a
We can also convert it back using some purrr tricks.
library(tidyr)
library(purrr)
df_nest<- df %>%
tidyr::separate_rows(value, sep=",") %>%
group_by(id) %>%
nest()
df_nest
#> # A tibble: 3 × 2
#> # Groups: id [3]
#> id data
#> <dbl> <list>
#> 1 1 <tibble [2 × 1]>
#> 2 2 <tibble [2 × 1]>
#> 3 3 <tibble [1 × 1]>
df_nest$data[[1]]
#> # A tibble: 2 × 1
#> value
#> <chr>
#> 1 x
#> 2 y
# convert it back
df_nest %>%
mutate(new_data = map_chr(data, ~paste(.x$value, collapse = ",")))
#> # A tibble: 3 × 3
#> # Groups: id [3]
#> id data new_data
#> <dbl> <list> <chr>
#> 1 1 <tibble [2 × 1]> x,y
#> 2 2 <tibble [2 × 1]> z,w
#> 3 3 <tibble [1 × 1]> a
solution with python
Note, how to use python in Rmarkdown, read https://rstudio.github.io/reticulate/articles/versions.html
library(reticulate)
use_condaenv("r-reticulate")
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3], 'data': ['x, y', 'z, w', 'a']})
print(df)
# Split text into a list
#> id data
#> 0 1 x, y
#> 1 2 z, w
#> 2 3 a
df['data'] = df['data'].str.split(',')
print(df)
# Convert list into multiple rows
#> id data
#> 0 1 [x, y]
#> 1 2 [z, w]
#> 2 3 [a]
df = df.explode('data')
print(df)
#> id data
#> 0 1 x
#> 0 1 y
#> 1 2 z
#> 1 2 w
#> 2 3 a
we used explode
function in pandas. Of course, you can do it manually:
read in the dataframe line by line, split the last column to a list using str.split(‘,’) and open an empty file and write the str.splited entries into new lines.
use case in bioinformatics?
follow https://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-go.html for gene set enrichment analysis.
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
# Entrez gene ID
head(gene)
#> [1] "4312" "8318" "10874" "55143" "55388" "991"
Let’s do a go-term over-representation analysis
library(org.Hs.eg.db)
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
#> ID Description GeneRatio
#> GO:0005819 GO:0005819 spindle 26/201
#> GO:0000779 GO:0000779 condensed chromosome, centromeric region 16/201
#> GO:0072686 GO:0072686 mitotic spindle 17/201
#> GO:0000775 GO:0000775 chromosome, centromeric region 18/201
#> GO:0098687 GO:0098687 chromosomal region 23/201
#> GO:0000776 GO:0000776 kinetochore 15/201
#> BgRatio pvalue p.adjust qvalue
#> GO:0005819 306/11853 1.072029e-11 3.151766e-09 2.888837e-09
#> GO:0000779 114/11853 7.709944e-11 8.659125e-09 7.936756e-09
#> GO:0072686 133/11853 8.835841e-11 8.659125e-09 7.936756e-09
#> GO:0000775 158/11853 1.684987e-10 1.179661e-08 1.081250e-08
#> GO:0098687 272/11853 2.006225e-10 1.179661e-08 1.081250e-08
#> GO:0000776 106/11853 2.733425e-10 1.339378e-08 1.227644e-08
#> geneID
#> GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TRAT1/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
#> GO:0000779 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CCNB1
#> GO:0072686 KIF23/CENPE/ASPM/SKA1/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/TRAT1/AURKB/PRC1/KIFC1/KIF18B/AURKA
#> GO:0000775 CDCA8/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/CCNB1
#> GO:0098687 CDCA8/CENPE/NDC80/TOP2A/HJURP/SKA1/NEK2/CENPM/RAD51AP1/CENPN/CDK1/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/EZH2/TTK/NCAPG/AURKB/CHEK1/CCNB1/MCM5
#> GO:0000776 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
#> Count
#> GO:0005819 26
#> GO:0000779 16
#> GO:0072686 17
#> GO:0000775 18
#> GO:0098687 23
#> GO:0000776 15
The last geneID
column is a list of genes separated by /
. If I want to
get a tidy dataframe, I can do
ego@result %>%
tidyr::separate_rows(geneID, sep="/")
#> # A tibble: 918 × 9
#> ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <int>
#> 1 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 CDCA8 26
#> 2 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 CDC20 26
#> 3 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 KIF23 26
#> 4 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 CENPE 26
#> 5 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 ASPM 26
#> 6 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 DLGAP5 26
#> 7 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 SKA1 26
#> 8 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 NUSAP1 26
#> 9 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 TPX2 26
#> 10 GO:0005… spindle 26/201 306/11… 1.07e-11 3.15e-9 2.89e-9 TACC3 26
#> # ℹ 908 more rows
Now each gene has its own row!