Biopython序列处理:文件读写与NCBI数据获取

533 字
3 分钟
Biopython序列处理:文件读写与NCBI数据获取

Python 在生信中处理序列有三个层次:(1) 手写 open().read().split()——痛苦;(2) Biopython——优雅;(3) pysam/pyfaidx——极致性能。

对日常分析来说,Biopython 是最佳平衡点。

1. 安装#

Terminal window
pip install biopython
# 或
conda install -c conda-forge biopython
import Bio
print(Bio.__version__) # 1.84+

2. SeqIO——读FASTA/FASTQ只此一招#

from Bio import SeqIO
# 读FASTA
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"序列长度: {len(record.seq)}")
print(f"前20bp: {record.seq[:20]}")
break
# 读FASTQ(压缩格式自动识别)
for record in SeqIO.parse("sample.fastq.gz", "fastq"):
print(f"质量分数平均: {sum(record.letter_annotations['phred_quality'])/len(record.seq):.1f}")
break
# 一次性全部读入(小文件)
records = list(SeqIO.parse("genome.fa", "fasta"))
print(f"总序列数: {len(records)}")
# 按条件筛选并写出
with open("long_seqs.fasta", "w") as out:
long_records = (r for r in SeqIO.parse("input.fa", "fasta") if len(r.seq) >= 1000)
SeqIO.write(long_records, out, "fasta")

3. Seq 对象——序列操作#

from Bio.Seq import Seq
seq = Seq("ATCGATCGATCG")
print(seq.complement()) # TAGCTAGCTAGC
print(seq.reverse_complement()) # CGATCGATCGAT
print(seq.transcribe()) # AUCGAUCGAUCG
print(seq.translate()) # IDRS (氨基酸)
# GC含量
gc = (seq.count("G") + seq.count("C")) / len(seq) * 100
print(f"GC%: {gc:.1f}%")

4. Entrez——从NCBI获取数据#

from Bio import Entrez
# 必须设email(NCBI强制要求)
Entrez.email = "your@email.com"
# 搜索SRA数据库
handle = Entrez.esearch(db="sra", term="RNA-seq human brain", retmax=5)
record = Entrez.read(handle)
handle.close()
print(f"Found {record['Count']} records")
print(f"IDs: {record['IdList']}")
# 获取单个序列
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
seq_record = SeqIO.read(handle, "fasta")
handle.close()
print(f"Gene: {seq_record.description}")

注意: NCBI 对 Entrez 有频率限制(每秒不超过 3 次请求,不带 API key 的话)。批量查询时加 sleep(0.5)

5. 多序列比对文件处理#

from Bio import AlignIO
# 读CLUSTAL格式
alignment = AlignIO.read("alignment.clustal", "clustal")
print(f"序列数: {len(alignment)}")
print(f"比对长度: {alignment.get_alignment_length()}")
# 提取保守位点(某列完全相同)
conserved = 0
for i in range(alignment.get_alignment_length()):
col = alignment[:, i]
if len(set(col)) == 1:
conserved += 1
print(f"保守位点: {conserved}/{alignment.get_alignment_length()}")

6. 实用脚本#

批量统计FASTA文件#

import glob
from Bio import SeqIO
for f in glob.glob("*.fasta"):
records = list(SeqIO.parse(f, "fasta"))
total_len = sum(len(r.seq) for r in records)
print(f"{f}: {len(records)} seqs, {total_len:,} bp")

从GenBank提取CDS序列#

from Bio import SeqIO
for record in SeqIO.parse("sequence.gb", "genbank"):
for feature in record.features:
if feature.type == "CDS":
gene = feature.qualifiers.get("gene", ["unknown"])[0]
seq = feature.extract(record.seq)
print(f">{gene}\n{seq}")

7. 踩坑#

坑1:SeqIO.parse 是生成器——只能迭代一次。要多次用先转成 list()

坑2:Entrez.email 必须设置——不设会报 400 错误。

坑3:序列转译时的终止密码子——translate() 默认遇到终止密码子就停止(用 * 表示)。如果要译到末尾,用 translate(to_stop=False)


本文于 2025-05-28 实测。Biopython 1.84。

文章分享

如果这篇文章对你有帮助,欢迎分享给更多人!

Biopython序列处理:文件读写与NCBI数据获取
https://fg.ink/posts/biopython-sequence-processing/
作者
风观
发布于
2025-12-01
许可协议
CC BY-NC-SA 4.0
Profile Image of the Author
风观
风有来路,观有所思
分类
标签
站点统计
文章
50
分类
1
标签
29
总字数
61,837
运行时长
0
最后活动
0 天前

文章目录