Bash循环与条件判断:for/if/case全解
for 和 if 是 Shell 脚本中使用频率最高的两种控制结构——批量处理 FASTQ、判断比对状态、遍历染色体、重跑失败样本,几乎每个生信脚本都离不开循环和条件判断。本文覆盖 for/while/until 循环、if/case 分支、test 命令速查和 6 个批量处理模板。
实测环境:Debian 12,Bash 5.2。
1. for循环——生信批量处理的基石
1.1 基础语法
# 方式1:列表遍历(最常用)for sample in sample1 sample2 sample3; do echo "Processing ${sample}"done
# 方式2:花括号展开for i in {1..10}; do echo "Round ${i}"done
# 方式3:类似C的写法for ((i=1; i<=10; i++)); do echo "Index: ${i}"done
# 方式4:命令替换for file in $(ls *.fastq.gz); do echo "Found: ${file}"done强烈推荐方式1和方式3。方式4里的$(ls)有个经典问题:文件名含空格时会拆开。后面讲怎么安全遍历。
1.2 生信6大for循环模板
模板1:批量处理配对FASTQ
#!/bin/bashset -euo pipefail
# 假设文件名格式:Sample_S1_L001_R1_001.fastq.gzR1_FILES=(*_R1_*.fastq.gz)
for r1 in "${R1_FILES[@]}"; do r2="${r1/_R1_/_R2_}" # 替换R1为R2 sample_id="${r1%%_R1*}" # 提取样本ID sample_id="${sample_id##*/}" # 去掉路径(如果有)
echo "=== Processing ${sample_id} ==="
# fastp质控 fastp -i "${r1}" -I "${r2}" \ -o "clean/${sample_id}_R1.fastq.gz" \ -O "clean/${sample_id}_R2.fastq.gz" \ -j "reports/${sample_id}_fastp.json" \ -h "reports/${sample_id}_fastp.html" \ -w 8
echo "Done: ${sample_id}"done模板2:遍历BAM文件做统计
for bam in alignments/*.bam; do sample=$(basename "${bam}" .bam)
# flagstat统计 samtools flagstat "${bam}" > "stats/${sample}_flagstat.txt"
# 深度统计 samtools depth -a "${bam}" | \ awk '{sum+=$3; count++} END {print "Mean depth:", sum/count}' \ > "stats/${sample}_depth.txt"
echo "${sample}: $(cat stats/${sample}_depth.txt)"done模板3:遍历染色体的分染色体分析
CHROMS=(chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 \ chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 \ chr21 chr22 chrX chrY chrM)
INPUT_VCF="merged_variants.vcf.gz"
for chrom in "${CHROMS[@]}"; do echo "Splitting ${chrom}..."
bcftools view -r "${chrom}" "${INPUT_VCF}" \ -o "by_chrom/${chrom}.vcf.gz" -O z
bcftools index "by_chrom/${chrom}.vcf.gz"done模板4:带计数的循环——看进度
files=(*.fastq.gz)total=${#files[@]}current=0
for f in "${files[@]}"; do ((current++)) echo "[${current}/${total}] Processing ${f}..."
fastp -i "${f}" -o "clean/${f}" -w 4done
echo "All ${total} files processed!"${#files[@]} 获取数组长度,((current++)) 是算术自增。这个模式在处理几百个样本时特别有用——随时知道跑到哪了。
模板5:嵌套循环——多条件组合
SAMPLES=(WT_1 WT_2 KO_1 KO_2)TOOLS=(bwa bowtie2 hisat2)
for sample in "${SAMPLES[@]}"; do for tool in "${TOOLS[@]}"; do outdir="results/${tool}/${sample}" mkdir -p "${outdir}"
echo "Aligning ${sample} with ${tool}..." # 不同工具用不同命令... donedone模板6:for循环 + 条件跳过(排错必备)
FAILED_SAMPLES=() # 记录失败的
for bam in *.bam; do sample=$(basename "${bam}" .bam)
# 如果结果已存在且比BAM新,跳过 if [[ -f "variants/${sample}.vcf.gz" ]] && \ [[ "variants/${sample}.vcf.gz" -nt "${bam}" ]]; then echo "Skipping ${sample} (already done)" continue fi
echo "Calling variants for ${sample}..."
if ! bcftools mpileup -f ref.fa "${bam}" | \ bcftools call -mv -o "variants/${sample}.vcf.gz"; then echo "ERROR: ${sample} failed!" FAILED_SAMPLES+=("${sample}") fidone
echo "Failed samples: ${FAILED_SAMPLES[@]:-None}"continue 跳过一次循环,break 跳出整个循环。${FAILED_SAMPLES[@]:-None} 是参数展开的默认值用法。
2. while循环——处理动态输入
2.1 逐行读取文件
# 读取样本列表while IFS= read -r sample; do [[ -z "${sample}" || "${sample}" == \#* ]] && continue # 跳过空行和注释 echo "Processing: ${sample}" # ...处理逻辑done < sample_list.txtIFS= read -r 这个组合绝对不能省:-r 禁止反斜杠转义,IFS= 保留行首行尾空格。这是一个无数人踩过的大坑。
2.2 管道while——小心子shell
# ✗ 错误:管道中的while在子shell运行,变量不会传出来total=0cat counts.txt | while read -r n; do ((total += n))doneecho "${total}" # 输出:0 ——不是你要的!
# ✓ 正确:用重定向替代管道total=0while read -r n; do ((total += n))done < counts.txtecho "${total}" # 输出:正确累加值这个坑我至少踩了三次才记住。管道会创建子shell,while里的变量修改出不来。
2.3 无限循环+条件退出
# 轮询等待文件生成while true; do if [[ -f "pipeline_complete.flag" ]]; then echo "Pipeline finished!" break fi echo "Waiting..." sleep 60done2.4 until——条件为假时循环
# 等到磁盘空间够用until [[ $(df /data --output=pcent | tail -1 | tr -d ' %') -lt 20 ]]; do echo "Disk usage still >20%, waiting..." sleep 300doneecho "Disk OK, starting download"3. if条件判断——生信流程的决策中枢
3.1 test命令速查表
Bash里if后面跟的是一个命令(通常是test或其简写[ ]),靠退出码(0=真,非0=假)判断:
| 测试类型 | 语法 | 生信场景 |
|---|---|---|
| 文件存在 | [[ -f file ]] | 检查参考基因组存在 |
| 目录存在 | [[ -d dir ]] | 确保输出目录已创建 |
| 文件非空 | [[ -s file ]] | 检查BAM不是空的 |
| 文件可读 | [[ -r file ]] | 检查权限 |
| 文件1比文件2新 | [[ file1 -nt file2 ]] | 增量分析跳过已完成步骤 |
| 字符串相等 | [[ "$a" == "$b" ]] | 匹配样本类型 |
| 字符串不空 | [[ -n "$str" ]] | 参数是否传入 |
| 数值比较 | [[ $a -gt $b ]] | 测序深度阈值 |
| 正则匹配 | [[ "$a" =~ ^SRR ]] | 验证SRA ID格式 |
强烈建议使用[[ ]]不要用[ ]:[[ ]]是Bash内置,支持正则、不分割单词、不会因空变量报错。
3.2 生信6大条件判断模板
模板1:检查输入完整性后启动流程
REF="/opt/refs/hg38.fa"R1="sample_R1.fastq.gz"R2="sample_R2.fastq.gz"
if [[ ! -f "${REF}" ]]; then echo "ERROR: Reference genome not found: ${REF}" exit 1fi
if [[ ! -f "${R1}" ]]; then echo "ERROR: R1 file missing: ${R1}" exit 1fi
if [[ ! -f "${R2}" ]]; then echo "WARNING: R2 missing, running single-end mode" MODE="single"else MODE="paired"fi
echo "All checks passed. Starting pipeline (${MODE})..."模板2:按比对率判断是否重新比对
MAPPING_RATE=$(samtools flagstat "${bam}" | \ grep "mapped (" | grep -oP '\d+\.\d+(?=%)' | head -1)
if [[ $(echo "${MAPPING_RATE} < 70" | bc -l) -eq 1 ]]; then echo "WARNING: Low mapping rate (${MAPPING_RATE}%). Consider different aligner." # 或者发邮件通知fi模板3:if-elif-else多分支
READ_LENGTH=$(seqkit stats "${fastq}" | tail -1 | awk '{print $7}' | cut -d. -f1)
if [[ "${READ_LENGTH}" -lt 50 ]]; then ALIGNER="bowtie" # 超短readselif [[ "${READ_LENGTH}" -lt 150 ]]; then ALIGNER="bwa" # 短readselif [[ "${READ_LENGTH}" -lt 1000 ]]; then ALIGNER="minimap2 -x sr" # 中等长度else ALIGNER="minimap2 -x map-ont" # 长readsfi
echo "Auto-selected aligner: ${ALIGNER}"模板4:短路判断——一句话检查多个条件
# 检查所有必需软件check_tool() { command -v "$1" >/dev/null 2>&1 || { echo "ERROR: $1 not installed" exit 1 }}
for tool in bwa samtools bcftools fastp seqkit; do check_tool "${tool}"done
echo "All tools available!"模板5:正则匹配验证输入
SRA_ID="SRR12345678"
if [[ "${SRA_ID}" =~ ^(SRR|ERR|DRR)[0-9]{6,}$ ]]; then echo "Valid SRA ID: ${SRA_ID}"else echo "ERROR: Invalid SRA ID format" exit 1fi模板6:根据上一步退出码决定下一步
# 运行比对bwa mem -t 16 ref.fa reads.fq > aln.samALN_EXIT=$?
if [[ ${ALN_EXIT} -eq 0 ]]; then echo "Alignment OK, sorting..." samtools sort -@ 8 aln.sam -o aln.bamelse echo "ERROR: Alignment failed with code ${ALN_EXIT}" exit ${ALN_EXIT}fi4. case——多分支比if-elif更清晰
当有3个以上分支时,case 比 if-elif-else 可读性好得多:
INPUT_FMT="${1:-fastq}"
case "${INPUT_FMT}" in fastq|fq) echo "FASTQ mode" EXT="fastq.gz" ;; bam|sam) echo "BAM/SAM mode" EXT="bam" ;; vcf) echo "VCF mode" EXT="vcf.gz" ;; *) echo "Unknown format: ${INPUT_FMT}" echo "Supported: fastq, bam, vcf" exit 1 ;;esac生信中适合case的场景:
# 根据文件扩展名判断操作for file in *; do case "${file}" in *.fastq.gz|*.fq.gz) zcat "${file}" | wc -l ;; *.bam) samtools flagstat "${file}" ;; *.vcf.gz) bcftools stats "${file}" | head -5 ;; *.log) tail -20 "${file}" ;; esacdone5. 生信全流程实例:Bash批量RNA-seq预处理
#!/bin/bashset -euo pipefail
# ========== 配置 ==========DATA_DIR="./raw_data"OUT_DIR="./processed"REF="/opt/refs/hg38.fa"THREADS=16FAILED_LOG="failed_samples.txt"
> "${FAILED_LOG}" # 清空失败日志
# ========== 预检查 ==========for tool in fastp hisat2 samtools; do if ! command -v "${tool}" >/dev/null 2>&1; then echo "ERROR: ${tool} not found in PATH" exit 1 fidone
[[ -f "${REF}" ]] || { echo "ERROR: Ref genome missing"; exit 1; }mkdir -p "${OUT_DIR}/qc_reports" "${OUT_DIR}/bam" "${OUT_DIR}/logs"
# ========== 主循环 ==========R1_FILES=("${DATA_DIR}"/*_R1.fastq.gz)TOTAL_SAMPLES=${#R1_FILES[@]}CURRENT=0
for r1 in "${R1_FILES[@]}"; do ((CURRENT++))
# --- 配对R2 --- r2="${r1/_R1/_R2}" sample=$(basename "${r1}" | sed 's/_R1.fastq.gz//')
if [[ ! -f "${r2}" ]]; then echo "[${CURRENT}/${TOTAL_SAMPLES}] SKIP ${sample}: R2 missing" | tee -a "${FAILED_LOG}" continue fi
echo "[${CURRENT}/${TOTAL_SAMPLES}] Processing ${sample}..."
# --- fastp质控 --- if [[ -f "${OUT_DIR}/qc_reports/${sample}_fastp.json" ]]; then echo " QC report exists, skipping fastp" else fastp -i "${r1}" -I "${r2}" \ -o "${OUT_DIR}/${sample}_R1.fq.gz" \ -O "${OUT_DIR}/${sample}_R2.fq.gz" \ -j "${OUT_DIR}/qc_reports/${sample}_fastp.json" \ -h "${OUT_DIR}/qc_reports/${sample}_fastp.html" \ -w "${THREADS}" \ 2>&1 | tee "${OUT_DIR}/logs/${sample}_fastp.log" fi
# --- HISAT2比对 --- bam="${OUT_DIR}/bam/${sample}.bam"
if [[ -f "${bam}" ]] && [[ -s "${bam}" ]]; then echo " BAM exists, skipping alignment" else hisat2 -p "${THREADS}" -x "${REF%.*}" \ -1 "${OUT_DIR}/${sample}_R1.fq.gz" \ -2 "${OUT_DIR}/${sample}_R2.fq.gz" \ 2> "${OUT_DIR}/logs/${sample}_align.log" | \ samtools sort -@ "${THREADS}" -o "${bam}" -
if [[ $? -ne 0 ]]; then echo "[${CURRENT}/${TOTAL_SAMPLES}] FAILED: ${sample}" | tee -a "${FAILED_LOG}" continue fi
samtools index "${bam}" fi
# --- 比对率检查 --- mapping_rate=$(grep "overall alignment rate" "${OUT_DIR}/logs/${sample}_align.log" | \ grep -oP '\d+\.\d+%' | head -1) echo " ${sample}: mapping rate = ${mapping_rate}"
# --- 低比对率警告 --- if [[ -n "${mapping_rate}" ]]; then rate_num="${mapping_rate%\%}" if [[ $(echo "${rate_num} < 70" | bc -l) -eq 1 ]]; then echo " ⚠ WARNING: Low mapping rate for ${sample}!" | tee -a "low_mapping_samples.txt" fi fidone
# ========== 汇总 ==========echo ""echo "=========================================="echo "Pipeline complete!"echo "Total samples: ${TOTAL_SAMPLES}"echo "Failed: $(wc -l < ${FAILED_LOG})"echo "=========================================="这个脚本涵盖了本篇几乎所有知识点:for循环、if判断、条件跳过、退出码检查、数组、变量展开、日志输出。
6. 踩坑记录
坑1:for file in *.fastq.gz 没有匹配时变成字面字符串
症状:当目录下没有 .fastq.gz 文件时,for f in *.fastq.gz; do 循环依然执行一次,$f 的值是字面量 *.fastq.gz。
# ✓ 解决:先检查或用 shoptshopt -s nullglob # 没匹配时展开为空for f in *.fastq.gz; do echo "Processing $f"doneshopt -u nullglob # 恢复默认
# 或者先判断files=(*.fastq.gz)if [[ ! -e "${files[0]}" ]]; then echo "No FASTQ files found"; exit 1fi坑2:while read 最后一行不处理(缺换行符)
症状:文件最后一行没有换行,while read 就读不到它。
# ✓ 用 || [[ -n "$line" ]] 兜底while IFS= read -r line || [[ -n "$line" ]]; do echo "$line"done < file.txt坑3:if [ $a == $b ] 中变量为空导致语法错误
症状:a="" 时 [ $a == "hello" ] 展开成 [ == hello ],[ 命令报语法错误。
# ✓ 用 [[ ]](推荐)或加引号[[ $a == "hello" ]] # 内置,安全[ "$a" == "hello" ] # 传统,必须加引号坑4:for 遍历命令输出时空格/换行问题
症状:for f in $(find . -name "*.bam") 遇到文件名含空格就拆开。
# ✓ 用 while read 或 find -print0 + while read -d ''find . -name "*.bam" -print0 | while IFS= read -r -d '' f; do echo "Processing: ${f}"done
# 或启用 globstar 替代 findshopt -s globstarfor f in **/*.bam; do echo "Processing: ${f}"done坑5:循环内修改全局变量在管道中丢失
已在 2.2 节详述。补充一个进程替换的解决方案:
# 如果必须用管道,用进程替换保持变量作用域total=0while read -r n; do ((total += n))done < <(cat counts.txt) # 进程替换,不是管道echo "${total}" # 正确坑6:break 只跳出最内层循环
症状:嵌套循环里 break 只跳出内层。
# 跳出多层用 break Nfor i in {1..5}; do for j in {1..5}; do if [[ $i -eq 3 && $j -eq 3 ]]; then break 2 # 跳出两层 fi donedone坑7:(( i++ )) 中 i 未初始化
Bash中未初始化的变量在算术运算中当0处理,这点OK。但如果在 set -u 下:
set -ufor f in *.txt; do (( count++ )) # 报错:count: unbound variabledone
# ✓ 先初始化count=0for f in *.txt; do (( count++ ))doneecho "${count}"7. 总结
| 需求 | 用这个 | 一句话 |
|---|---|---|
| 遍历固定列表 | for i in list | 简单直接 |
| 按数量/范围循环 | for ((i=1;i<=N;i++)) | C风格循环 |
| 逐行读文件 | while read -r | 处理大样本列表 |
| 等待条件满足 | while true / until | 监控类任务 |
| 简单判断 | if [[ condition ]] | 90%的场景 |
| 多分支 | case ... esac | >3个分支时用 |
| 根据上一步结果 | if command; then | 直接用命令退出码 |
循环和条件判断是Shell脚本的骨架。把上面的6+6+2=14个模板收藏好,下次写生信脚本时直接复制改参数,比从头写快10倍。
这两个符号记牢:[[ ]] 和 (( ))——前者是字符串/文件的测试,后者是算术运算。别跟他们死磕,注意:字符串用 [[ ]],数字用 (( ))。
本文于 2025-03-15 在 Debian 12 (Bash 5.2) 上实测完成。所有代码可直接运行。
文章分享
如果这篇文章对你有帮助,欢迎分享给更多人!