Biopython序列处理：文件读写与NCBI数据获取

Python 在生信中处理序列有三个层次：(1) 手写 open().read().split()——痛苦；(2) Biopython——优雅；(3) pysam/pyfaidx——极致性能。

对日常分析来说，Biopython 是最佳平衡点。

1. 安装#

1
pip install biopython
2
# 或
3
conda install -c conda-forge biopython

1
import Bio
2
print(Bio.__version__)  # 1.84+

2. SeqIO——读FASTA/FASTQ只此一招#

1
from Bio import SeqIO
2

3
# 读FASTA
4
for record in SeqIO.parse("sequences.fasta", "fasta"):
5
    print(f"ID: {record.id}")
6
    print(f"序列长度: {len(record.seq)}")
7
    print(f"前20bp: {record.seq[:20]}")
8
    break
9

10
# 读FASTQ（压缩格式自动识别）
11
for record in SeqIO.parse("sample.fastq.gz", "fastq"):
12
    print(f"质量分数平均: {sum(record.letter_annotations['phred_quality'])/len(record.seq):.1f}")
13
    break
14

15
# 一次性全部读入（小文件）
16
records = list(SeqIO.parse("genome.fa", "fasta"))
17
print(f"总序列数: {len(records)}")
18

19
# 按条件筛选并写出
20
with open("long_seqs.fasta", "w") as out:
21
    long_records = (r for r in SeqIO.parse("input.fa", "fasta") if len(r.seq) >= 1000)
22
    SeqIO.write(long_records, out, "fasta")

3. Seq 对象——序列操作#

1
from Bio.Seq import Seq
2

3
seq = Seq("ATCGATCGATCG")
4
print(seq.complement())     # TAGCTAGCTAGC
5
print(seq.reverse_complement())  # CGATCGATCGAT
6
print(seq.transcribe())     # AUCGAUCGAUCG
7
print(seq.translate())      # IDRS (氨基酸)
8

9
# GC含量
10
gc = (seq.count("G") + seq.count("C")) / len(seq) * 100
11
print(f"GC%: {gc:.1f}%")

4. Entrez——从NCBI获取数据#

1
from Bio import Entrez
2

3
# 必须设email（NCBI强制要求）
4
Entrez.email = "your@email.com"
5

6
# 搜索SRA数据库
7
handle = Entrez.esearch(db="sra", term="RNA-seq human brain", retmax=5)
8
record = Entrez.read(handle)
9
handle.close()
10
print(f"Found {record['Count']} records")
11
print(f"IDs: {record['IdList']}")
12

13
# 获取单个序列
14
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
15
seq_record = SeqIO.read(handle, "fasta")
16
handle.close()
17
print(f"Gene: {seq_record.description}")

注意： NCBI 对 Entrez 有频率限制（每秒不超过 3 次请求，不带 API key 的话）。批量查询时加 sleep(0.5)。

5. 多序列比对文件处理#

1
from Bio import AlignIO
2

3
# 读CLUSTAL格式
4
alignment = AlignIO.read("alignment.clustal", "clustal")
5
print(f"序列数: {len(alignment)}")
6
print(f"比对长度: {alignment.get_alignment_length()}")
7

8
# 提取保守位点（某列完全相同）
9
conserved = 0
10
for i in range(alignment.get_alignment_length()):
11
    col = alignment[:, i]
12
    if len(set(col)) == 1:
13
        conserved += 1
14
print(f"保守位点: {conserved}/{alignment.get_alignment_length()}")

6. 实用脚本#

批量统计FASTA文件#

1
import glob
2
from Bio import SeqIO
3

4
for f in glob.glob("*.fasta"):
5
    records = list(SeqIO.parse(f, "fasta"))
6
    total_len = sum(len(r.seq) for r in records)
7
    print(f"{f}: {len(records)} seqs, {total_len:,} bp")

从GenBank提取CDS序列#

1
from Bio import SeqIO
2

3
for record in SeqIO.parse("sequence.gb", "genbank"):
4
    for feature in record.features:
5
        if feature.type == "CDS":
6
            gene = feature.qualifiers.get("gene", ["unknown"])[0]
7
            seq = feature.extract(record.seq)
8
            print(f">{gene}\n{seq}")

7. 踩坑#

坑1：SeqIO.parse 是生成器——只能迭代一次。要多次用先转成 list()。

坑2：Entrez.email 必须设置——不设会报 400 错误。

坑3：序列转译时的终止密码子——translate() 默认遇到终止密码子就停止（用 * 表示）。如果要译到末尾，用 translate(to_stop=False)。

本文于 2025-05-28 实测。Biopython 1.84。