ENSEMBL BioMart批量数据导出:REST API与biomaRt

2815 字
14 分钟
ENSEMBL BioMart批量数据导出:REST API与biomaRt

ENSEMBL BioMart 提供基因功能注释的批量导出功能。除网页界面外,BioMart 还提供 REST API 和 R/Bioconductor 接口(biomaRt),可自动化批量获取 GO/KEGG/InterPro/PFAM 注释和跨物种同源基因信息。本文覆盖 Python(REST API)和 R(biomaRt)两种实现方式。

实测环境:Debian 12,Python 3.10 + requests,R 4.4 + biomaRt 2.60。

1. BioMart 是什么#

ENSEMBL BioMart 是 ENSEMBL 基因组数据库的查询接口。你可以把它想象成”基因组数据库的 SQL”,但不用写 SQL——选数据集(database)、选过滤器(filter)、选输出属性(attribute),它返回你想要的表。

BioMart 的数据结构是层次化的:

ENSEMBL Genes (mart)
├── 物种1 (dataset, 如 hsapiens_gene_ensembl)
│ ├── 基因属性 (attributes: gene_id, gene_name, description...)
│ └── 过滤器 (filters: chromosome, gene_name, biotype...)
├── 物种2
└── ...

一个具体的查询案例: “人类(hsapiens)中,染色体 1 上所有蛋白质编码基因(protein_coding),输出基因 ID、基因名、GO 注释。”

这在 BioMart 里对应的参数:

参数
martENSEMBL_MART_ENSEMBL
datasethsapiens_gene_ensembl
filterschromosome_name=1, biotype=protein_coding
attributesensembl_gene_id, external_gene_name, go_id

2. Python 方案——REST API#

BioMart 提供 RESTful API,返回 JSON/XML/TSV。Python 用 requests 库就能调。

2.1 查询基因基础信息#

#!/usr/bin/env python3
"""
fetch_gene_info.py - 通过 BioMart REST API 获取基因基础注释
Debian 12 实测 2025-12-05
"""
import requests
import json
import time
BASE_URL = "https://rest.ensembl.org"
def query_biomart(dataset, filters, attributes):
"""
通用 BioMart 查询函数
参数:
dataset: 数据集名,如 "hsapiens_gene_ensembl"
filters: 字典,过滤条件 {"filter_name": "value"}
attributes: 列表,输出属性 ["attr1", "attr2"]
返回:
list of dict,查询结果
"""
url = f"{BASE_URL}/biomart/martservice/results"
# 构造 XML payload(BioMart 的 REST API 用 XML 描述查询)
filter_xml = "".join(
f'<Filter name="{k}" value="{v}"/>'
for k, v in filters.items()
)
attr_xml = "".join(
f'<Attribute name="{a}"/>'
for a in attributes
)
payload = f"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName="default" formatter="TSV"
header="1" uniqueRows="1"
datasetConfigVersion="0.6">
<Dataset name="{dataset}" interface="default">
{filter_xml}
{attr_xml}
</Dataset>
</Query>"""
headers = {"Content-Type": "application/x-www-form-urlencoded"}
resp = requests.post(
url,
data={"query": payload},
headers=headers,
timeout=120
)
if resp.status_code != 200:
raise Exception(f"API error {resp.status_code}: {resp.text}")
# TSV → 字典列表
lines = resp.text.strip().split("\n")
if len(lines) < 2:
return []
header = lines[0].split("\t")
results = []
for line in lines[1:]:
values = line.split("\t")
results.append(dict(zip(header, values)))
return results
# ===== 示例1:获取TP53及其靶基因的基础信息 =====
gene_list = ["TP53", "MDM2", "CDKN1A", "BAX", "BCL2", "GADD45A"]
filters = {"hgnc_symbol": ",".join(gene_list)}
attributes = [
"ensembl_gene_id",
"external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"gene_biotype",
"description"
]
print("=== 查询基因基础信息 ===")
results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
for row in results:
gene = row.get("Gene name", "N/A")
chrom = row.get("Chromosome/scaffold name", "N/A")
start = row.get("Gene start (bp)", "N/A")
biotype = row.get("Gene type", "N/A")
desc = row.get("Gene description", "N/A")
# 计算基因长度
if start != "N/A" and row.get("Gene end (bp)", "N/A") != "N/A":
length = int(row["Gene end (bp)"]) - int(start) + 1
length_str = f"{length:,}bp"
else:
length_str = "N/A"
print(f" {gene}: {chrom}:{start}, {biotype}, {length_str}")
print(f" {desc[:80]}...")
time.sleep(0.5) # 礼貌性延迟,别打爆 API

2.2 批量获取 GO 注释#

# ===== 示例2:获取基因的GO注释 =====
filters = {"hgnc_symbol": ",".join(gene_list)}
attributes = [
"ensembl_gene_id",
"external_gene_name",
"go_id",
"name_1006", # GO term 名称
"namespace_1003", # GO 类别(biological_process/等)
"go_linkage_type" # IEA/IDA/IMP...(证据类型)
]
print("\n=== 查询 GO 注释 ===")
results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
# 按基因分组整理
from collections import defaultdict
go_by_gene = defaultdict(list)
for row in results:
gene = row.get("Gene name", "N/A")
go_id = row.get("GO term accession", "")
go_name = row.get("GO term name", "")
go_ns = row.get("GO domain", "")
evidence = row.get("GO term evidence code", "")
if go_id:
go_by_gene[gene].append({
"id": go_id,
"name": go_name,
"namespace": go_ns,
"evidence": evidence
})
for gene, go_terms in go_by_gene.items():
print(f"\n {gene} ({len(go_terms)} GO terms):")
# 按 namespace 分组
by_ns = defaultdict(list)
for t in go_terms:
by_ns[t["namespace"]].append(t)
for ns, terms in sorted(by_ns.items()):
print(f" [{ns}]:")
for t in terms[:3]: # 每类只显示3个
print(f" {t['id']} {t['name']} ({t['evidence']})")

2.3 获取跨物种同源基因#

# ===== 示例3:人类→小鼠同源基因 =====
filters = {"hgnc_symbol": ",".join(gene_list)}
attributes = [
"ensembl_gene_id",
"external_gene_name",
"mmusculus_homolog_ensembl_gene",
"mmusculus_homolog_associated_gene_name",
"mmusculus_homolog_orthology_type",
"mmusculus_homolog_perc_id_r1"
]
print("\n=== 查询人类→小鼠同源基因 ===")
results = query_biomart("hsapiens_gene_ensembl", filters, attributes)
for row in results:
hgnc = row.get("Gene name", "N/A")
mouse_id = row.get("Mouse gene stable ID", "N/A")
mouse_name = row.get("Mouse gene name", "N/A")
ortho_type = row.get("Mouse orthology type", "N/A")
identity = row.get("Mouse %id. query gene identical to target gene", "N/A")
if mouse_id:
print(f" {hgnc}{mouse_name} ({ortho_type}, {identity}% identity)")

关于 API 响应的理解:

BioMart REST API 返回 TSV 时,列名是 BioMart 的内部名称,不是人类友好的名字。比如 "Gene name" 对应你请求的 external_gene_name"GO term accession" 对应 go_id建议在脚本里用 row.get("Gene name", "N/A") 而不是 row["external_gene_name"]——因为 ENSEMBL 的列名映射偶尔会变。

3. R 方案——biomaRt#

Python REST API 灵活但冗长。如果你已经在用 R 做分析,biomaRt 包 的接口更简洁——因为它帮你处理了 XML 构造和列名映射。

3.1 基础查询#

#!/usr/bin/env Rscript
# fetch_genes.R - 用 biomaRt 批量获取基因注释
# Debian 12 实测 2025-12-05
library(biomaRt)
# ===== 1. 连接 BioMart =====
# 列出可用的 mart
listEnsembl()
# 选择 ENSEMBL Genes mart
ensembl <- useEnsembl(
biomart = "genes",
dataset = "hsapiens_gene_ensembl"
)
# 确认连接成功
ensembl
# Object of class 'Mart':
# Using the ENSEMBL Genes dataset
# Using the hsapiens_gene_ensembl dataset

3.2 获取基因注释#

# ===== 2. 按基因名查询基础信息 =====
gene_list <- c("TP53", "MDM2", "CDKN1A", "BAX", "BCL2", "GADD45A")
# 查看可用属性
attrs <- listAttributes(ensembl)
head(attrs[, c("name", "description")], 10)
# 执行查询
gene_info <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"strand",
"gene_biotype",
"description"
),
filters = "hgnc_symbol",
values = gene_list,
mart = ensembl
)
# 计算基因长度
gene_info$gene_length_bp <- gene_info$end_position - gene_info$start_position + 1
print(gene_info[, c("external_gene_name", "chromosome_name",
"gene_biotype", "gene_length_bp")])
# external_gene_name chromosome_name gene_biotype gene_length_bp
# 1 CDKN1A 6 protein_coding 50481
# 2 BAX 19 protein_coding 7096
# 3 BCL2 18 protein_coding 198810
# 4 GADD45A 1 protein_coding 3017
# 5 MDM2 12 protein_coding 36675
# 6 TP53 17 protein_coding 26383

3.3 批量 GO 注释#

# ===== 3. 批量获取 GO 注释 =====
go_terms <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"go_id",
"name_1006", # GO term 名称
"namespace_1003", # GO 类别
"go_linkage_type" # 证据代码
),
filters = "hgnc_symbol",
values = gene_list,
mart = ensembl
)
head(go_terms, 10)
# 统计每个基因的 GO 数量和证据类型分布
library(dplyr)
go_summary <- go_terms %>%
group_by(external_gene_name, namespace_1003) %>%
summarise(
n_terms = n(),
experimental = sum(go_linkage_type %in% c("IDA", "IMP", "IGI", "IPI")),
computational = sum(go_linkage_type %in% c("IEA", "ISS")),
.groups = "drop"
)
print(go_summary)

3.4 获取序列#

biomaRt 也可以直接拉基因序列:

# ===== 4. 获取基因序列 =====
# 查看可用的序列类型
listFilters(ensembl)[grep("seq", listFilters(ensembl)$name), ]
# 获取 cDNA 序列
cdna_seqs <- getSequence(
id = gene_info$ensembl_gene_id,
type = "ensembl_gene_id",
seqType = "cdna",
mart = ensembl
)
# 写入 FASTA
write.fasta <- function(ids, seqs, file) {
con <- file(file, "w")
for (i in seq_along(ids)) {
writeLines(paste0(">", ids[i]), con)
writeLines(seqs[i], con)
}
close(con)
}
# 只在有序列数据时才写入
if (nrow(cdna_seqs) > 0) {
write.fasta(cdna_seqs$ensembl_gene_id, cdna_seqs$cdna,
"gene_cdna.fasta")
cat(sprintf("Wrote %d cDNA sequences to gene_cdna.fasta\n",
nrow(cdna_seqs)))
}
# 获取蛋白序列
protein_seqs <- getSequence(
id = gene_info$ensembl_gene_id,
type = "ensembl_gene_id",
seqType = "peptide",
mart = ensembl
)

3.5 跨物种同源基因(人类→小鼠)#

# ===== 5. 人类→小鼠同源基因 =====
mouse_homologs <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
# 小鼠同源基因属性
"mmusculus_homolog_ensembl_gene",
"mmusculus_homolog_associated_gene_name",
"mmusculus_homolog_orthology_type",
"mmusculus_homolog_perc_id_r1",
"mmusculus_homolog_perc_id",
# 同源关系类型
"mmusculus_homolog_orthology_confidence"
),
filters = "hgnc_symbol",
values = gene_list,
mart = ensembl
)
print(mouse_homologs)

3.6 批量处理——上千个基因的 ID 转换#

这是生信里最常用的场景之一:有一列 Ensembl ID,要转成 Gene Symbol:

# ===== 6. ID 批量转换 =====
# 模拟一批 Ensembl ID
ensembl_ids <- c(
"ENSG00000141510", # TP53
"ENSG00000135679", # MDM2
"ENSG00000124762", # CDKN1A
"ENSG00000087088", # BAX
"ENSG00000171791", # BCL2
"ENSG00000116717" # GADD45A
)
id_map <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"entrezgene_id",
"uniprotswissprot",
"hgnc_id"
),
filters = "ensembl_gene_id",
values = ensembl_ids,
mart = ensembl
)
print(id_map)
# ensembl_gene_id external_gene_name entrezgene_id uniprotswissprot hgnc_id
# 1 ENSG00000087088 BAX 581 Q07812 990
# 2 ENSG00000116717 GADD45A 1647 P24522 4095
# 3 ENSG00000124762 CDKN1A 1026 P38936 1784
# 4 ENSG00000135679 MDM2 4193 NA 6973
# 5 ENSG00000141510 TP53 7157 P04637 11998
# 6 ENSG00000171791 BCL2 596 P10415 990

4. BioMart 查询的高阶用法#

4.1 根据 GO Term 反查基因#

“哪些基因有 GO:0006915(凋亡过程)注释?”

genes_with_go <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"go_id",
"name_1006"
),
filters = "go",
values = "GO:0006915",
mart = ensembl
)
cat(sprintf("Found %d genes with GO:0006915\n", nrow(genes_with_go)))
head(genes_with_go)

注意: GO 过滤器会匹配该 GO term 及其所有子 term。所以查询 GO:0006915 实际返回的范围可能比你预想的大很多。GO 的层级结构导致实际基因数 = 直接注释数 + 所有子 term 的注释数:

Ntotal=t{query_termdescendants}Nannotations(t)N_{total} = \sum_{t \in \{query\_term \cup descendants\}} N_{annotations}(t)

4.2 基因组区间查询#

“chr1:1000000-2000000 区间内有哪些蛋白编码基因?“

genes_in_region <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"chromosome_name",
"start_position",
"end_position",
"gene_biotype"
),
filters = c("chromosome_name", "start", "end"),
values = list(
chromosome_name = "1",
start = "1000000",
end = "2000000"
),
mart = ensembl
)
cat(sprintf("Found %d genes in chr1:1000000-2000000\n", nrow(genes_in_region)))

4.3 按 biotype 过滤——只拉 lncRNA#

lncrna_genes <- getBM(
attributes = c("ensembl_gene_id", "external_gene_name", "gene_biotype"),
filters = "biotype",
values = "lncRNA",
mart = ensembl
)
cat(sprintf("Total lncRNA genes: %d\n", nrow(lncrna_genes)))

5. 踩坑记录#

坑1:ENSEMBL 版本——GRCh37 vs GRCh38#

症状:用 GRCh38 坐标查询基因,结果为空或偏移。

原因:ENSEMBL 默认数据集使用最新版基因组(GRCh38)。如果你的 BED 文件是 GRCh37(hg19),坐标完全不匹配。

解决: 使用 useEnsembl()version 参数指定旧版:

# GRCh37 对应 ENSEMBL 75
ensembl_grch37 <- useEnsembl(
biomart = "genes",
dataset = "hsapiens_gene_ensembl",
version = 75 # GRCh37
)
# GRCh38 对应 ENSEMBL 最新版(不指定version)
ensembl_grch38 <- useEnsembl(
biomart = "genes",
dataset = "hsapiens_gene_ensembl"
)

坑2:REST API 返回 429 Too Many Requests#

BioMart REST API 没有官方文档写速率限制,但实测连续快发 > 15 个请求后会触发 429。

解决:

import time
# 每次请求后至少等0.5秒
time.sleep(0.5)
# 或者用重试逻辑
import time
max_retries = 3
for attempt in range(max_retries):
resp = requests.post(url, ...)
if resp.status_code == 429:
wait = 2 ** attempt # 指数退避
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
elif resp.status_code == 200:
break

坑3:某些基因查不到注释#

getBM 返回空行或基因数量少于输入——新手常以为是 bug,实际是正常行为。

原因:

  • 该基因在当前 ENSEMBL 版本中已废弃(retired)
  • 该 filter 条件下该基因没有对应属性(比如查 lncRNA 的蛋白序列)
  • Ensembl ID 版本号不对(ENSG00000141510 vs ENSG00000141510.16)

排查:

# 检查哪些输入基因没命中
input_genes <- c("TP53", "NONEXISTENT", "MDM2")
results <- getBM(
attributes = c("hgnc_symbol"),
filters = "hgnc_symbol",
values = input_genes,
mart = ensembl
)
found <- unique(results$hgnc_symbol)
missing <- setdiff(input_genes, found)
if (length(missing) > 0) {
cat("Missing genes:", paste(missing, collapse = ", "), "\n")
}

坑4:BioMart 连接超时——国内网络#

ENSEMBL 的 API 服务器在欧洲,国内直连偶发超时。症状:useEnsembl()requests.post() 报 connection timeout。

解决:

# biomaRt 设置镜像
ensembl <- useEnsembl(
biomart = "genes",
dataset = "hsapiens_gene_ensembl",
mirror = "uswest" # 或 "asia", "useast"
)

Python 的话直接修改 rest.ensembl.orguseast.ensembl.orgasia.ensembl.org

BASE_URL = "https://useast.ensembl.org" # 美国东部镜像

坑5:biomaRt 的 getSequence 对大型查询极慢#

症状:1000 个基因的序列查询跑了 10 分钟还没出结果。

原因: getSequence 是逐个基因发请求,不是批量查询。1000 个基因 = 1000 次 HTTP 请求。

解决:

  • 分批查询,每批 50 个:
batch_size <- 50
results_list <- list()
for (i in seq(1, length(gene_list), batch_size)) {
batch <- gene_list[i:min(i+batch_size-1, length(gene_list))]
results_list[[length(results_list)+1]] <- getSequence(
id = batch, type = "hgnc_symbol",
seqType = "cdna", mart = ensembl
)
}
all_sequences <- do.call(rbind, results_list)
  • 或者直接用 ENSEMBL FTP 下载整个物种的 cDNA 文件(更快)。

坑6:getBM 返回重复行#

症状:输入 6 个基因,返回 60 行,每个基因出现 10 次。

原因: 你请求的属性中包含”一对多”关系的字段。比如一个基因有 10 个 GO term,结果就有 10 行。

这不是 bug,是关系型数据的正常表现。但如果你期望一行一个基因,先检查哪些属性是”一对多”的:

# 查看属性描述中的"multi-valued"标记
listAttributes(ensembl)[grep("GO", listAttributes(ensembl)$name),
c("name", "description")]

坑7:Python REST API 的 XML 格式严格#

症状:请求返回 400 Bad Request,但 XML 看起来没问题。

常见 XML 错误:

  • & 没有转义成 &amp;(基因名里的 & 极少但不排除)
  • 过滤器值为空字符串
  • <Attribute> 拼错属性名

排查方法:

# 保存 XML 到文件人工检查
with open("debug_query.xml", "w") as f:
f.write(payload)
# 用 ENSEMBL 的 XML 验证工具
# 浏览器打开 https://www.ensembl.org/biomart/martview
# 点 "XML" 按钮粘贴你的 payload 验证

6. 总结#

任务R (biomaRt)Python (REST API)
基因注释查询getBM() 一行搞定需构造 XML,30行
GO/KEGG 注释getBM() + 批量化XML filter 更灵活
序列获取getSequence() 慢但简单REST API 或 FTP
ID 转换getBM() 最方便XML payload 较长
大规模查询分批循环可多线程并行
国内网络mirror="asia"换 useast 镜像

选 Python 还是 R: 如果你的下游分析是 R(DEseq2、ggplot2),用 biomaRt 最省事。如果你在写生产级 Pipeline(Python/Snakemake),REST API 更可控。


本文于 2025-12-05 在 Debian 12 上实测完成。biomaRt 2.60.1,Python 3.10 + requests 2.31,ENSEMBL release 113。所有代码可复制运行。

文章分享

如果这篇文章对你有帮助,欢迎分享给更多人!

ENSEMBL BioMart批量数据导出:REST API与biomaRt
https://fg.ink/posts/ensembl-biomart-api/
作者
风观
发布于
2025-02-01
许可协议
CC BY-NC-SA 4.0
Profile Image of the Author
风观
风有来路,观有所思
分类
标签
站点统计
文章
50
分类
1
标签
29
总字数
61,837
运行时长
0
最后活动
0 天前

文章目录