Bash循环与条件判断:for/if/case全解

2844 字
14 分钟
Bash循环与条件判断:for/if/case全解

forif 是 Shell 脚本中使用频率最高的两种控制结构——批量处理 FASTQ、判断比对状态、遍历染色体、重跑失败样本,几乎每个生信脚本都离不开循环和条件判断。本文覆盖 for/while/until 循环、if/case 分支、test 命令速查和 6 个批量处理模板。

实测环境:Debian 12,Bash 5.2。

1. for循环——生信批量处理的基石#

1.1 基础语法#

Terminal window
# 方式1:列表遍历(最常用)
for sample in sample1 sample2 sample3; do
echo "Processing ${sample}"
done
# 方式2:花括号展开
for i in {1..10}; do
echo "Round ${i}"
done
# 方式3:类似C的写法
for ((i=1; i<=10; i++)); do
echo "Index: ${i}"
done
# 方式4:命令替换
for file in $(ls *.fastq.gz); do
echo "Found: ${file}"
done

强烈推荐方式1和方式3。方式4里的$(ls)有个经典问题:文件名含空格时会拆开。后面讲怎么安全遍历。

1.2 生信6大for循环模板#

模板1:批量处理配对FASTQ

#!/bin/bash
set -euo pipefail
# 假设文件名格式:Sample_S1_L001_R1_001.fastq.gz
R1_FILES=(*_R1_*.fastq.gz)
for r1 in "${R1_FILES[@]}"; do
r2="${r1/_R1_/_R2_}" # 替换R1为R2
sample_id="${r1%%_R1*}" # 提取样本ID
sample_id="${sample_id##*/}" # 去掉路径(如果有)
echo "=== Processing ${sample_id} ==="
# fastp质控
fastp -i "${r1}" -I "${r2}" \
-o "clean/${sample_id}_R1.fastq.gz" \
-O "clean/${sample_id}_R2.fastq.gz" \
-j "reports/${sample_id}_fastp.json" \
-h "reports/${sample_id}_fastp.html" \
-w 8
echo "Done: ${sample_id}"
done

模板2:遍历BAM文件做统计

Terminal window
for bam in alignments/*.bam; do
sample=$(basename "${bam}" .bam)
# flagstat统计
samtools flagstat "${bam}" > "stats/${sample}_flagstat.txt"
# 深度统计
samtools depth -a "${bam}" | \
awk '{sum+=$3; count++} END {print "Mean depth:", sum/count}' \
> "stats/${sample}_depth.txt"
echo "${sample}: $(cat stats/${sample}_depth.txt)"
done

模板3:遍历染色体的分染色体分析

Terminal window
CHROMS=(chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 \
chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 \
chr21 chr22 chrX chrY chrM)
INPUT_VCF="merged_variants.vcf.gz"
for chrom in "${CHROMS[@]}"; do
echo "Splitting ${chrom}..."
bcftools view -r "${chrom}" "${INPUT_VCF}" \
-o "by_chrom/${chrom}.vcf.gz" -O z
bcftools index "by_chrom/${chrom}.vcf.gz"
done

模板4:带计数的循环——看进度

Terminal window
files=(*.fastq.gz)
total=${#files[@]}
current=0
for f in "${files[@]}"; do
((current++))
echo "[${current}/${total}] Processing ${f}..."
fastp -i "${f}" -o "clean/${f}" -w 4
done
echo "All ${total} files processed!"

${#files[@]} 获取数组长度,((current++)) 是算术自增。这个模式在处理几百个样本时特别有用——随时知道跑到哪了。

模板5:嵌套循环——多条件组合

Terminal window
SAMPLES=(WT_1 WT_2 KO_1 KO_2)
TOOLS=(bwa bowtie2 hisat2)
for sample in "${SAMPLES[@]}"; do
for tool in "${TOOLS[@]}"; do
outdir="results/${tool}/${sample}"
mkdir -p "${outdir}"
echo "Aligning ${sample} with ${tool}..."
# 不同工具用不同命令...
done
done

模板6:for循环 + 条件跳过(排错必备)

Terminal window
FAILED_SAMPLES=() # 记录失败的
for bam in *.bam; do
sample=$(basename "${bam}" .bam)
# 如果结果已存在且比BAM新,跳过
if [[ -f "variants/${sample}.vcf.gz" ]] && \
[[ "variants/${sample}.vcf.gz" -nt "${bam}" ]]; then
echo "Skipping ${sample} (already done)"
continue
fi
echo "Calling variants for ${sample}..."
if ! bcftools mpileup -f ref.fa "${bam}" | \
bcftools call -mv -o "variants/${sample}.vcf.gz"; then
echo "ERROR: ${sample} failed!"
FAILED_SAMPLES+=("${sample}")
fi
done
echo "Failed samples: ${FAILED_SAMPLES[@]:-None}"

continue 跳过一次循环,break 跳出整个循环。${FAILED_SAMPLES[@]:-None} 是参数展开的默认值用法。

2. while循环——处理动态输入#

2.1 逐行读取文件#

Terminal window
# 读取样本列表
while IFS= read -r sample; do
[[ -z "${sample}" || "${sample}" == \#* ]] && continue # 跳过空行和注释
echo "Processing: ${sample}"
# ...处理逻辑
done < sample_list.txt

IFS= read -r 这个组合绝对不能省:-r 禁止反斜杠转义,IFS= 保留行首行尾空格。这是一个无数人踩过的大坑。

2.2 管道while——小心子shell#

Terminal window
# ✗ 错误:管道中的while在子shell运行,变量不会传出来
total=0
cat counts.txt | while read -r n; do
((total += n))
done
echo "${total}" # 输出:0 ——不是你要的!
# ✓ 正确:用重定向替代管道
total=0
while read -r n; do
((total += n))
done < counts.txt
echo "${total}" # 输出:正确累加值

这个坑我至少踩了三次才记住。管道会创建子shell,while里的变量修改出不来。

2.3 无限循环+条件退出#

Terminal window
# 轮询等待文件生成
while true; do
if [[ -f "pipeline_complete.flag" ]]; then
echo "Pipeline finished!"
break
fi
echo "Waiting..."
sleep 60
done

2.4 until——条件为假时循环#

Terminal window
# 等到磁盘空间够用
until [[ $(df /data --output=pcent | tail -1 | tr -d ' %') -lt 20 ]]; do
echo "Disk usage still >20%, waiting..."
sleep 300
done
echo "Disk OK, starting download"

3. if条件判断——生信流程的决策中枢#

3.1 test命令速查表#

Bash里if后面跟的是一个命令(通常是test或其简写[ ]),靠退出码(0=真,非0=假)判断:

测试类型语法生信场景
文件存在[[ -f file ]]检查参考基因组存在
目录存在[[ -d dir ]]确保输出目录已创建
文件非空[[ -s file ]]检查BAM不是空的
文件可读[[ -r file ]]检查权限
文件1比文件2新[[ file1 -nt file2 ]]增量分析跳过已完成步骤
字符串相等[[ "$a" == "$b" ]]匹配样本类型
字符串不空[[ -n "$str" ]]参数是否传入
数值比较[[ $a -gt $b ]]测序深度阈值
正则匹配[[ "$a" =~ ^SRR ]]验证SRA ID格式

强烈建议使用[[ ]]不要用[ ][[ ]]是Bash内置,支持正则、不分割单词、不会因空变量报错。

3.2 生信6大条件判断模板#

模板1:检查输入完整性后启动流程

Terminal window
REF="/opt/refs/hg38.fa"
R1="sample_R1.fastq.gz"
R2="sample_R2.fastq.gz"
if [[ ! -f "${REF}" ]]; then
echo "ERROR: Reference genome not found: ${REF}"
exit 1
fi
if [[ ! -f "${R1}" ]]; then
echo "ERROR: R1 file missing: ${R1}"
exit 1
fi
if [[ ! -f "${R2}" ]]; then
echo "WARNING: R2 missing, running single-end mode"
MODE="single"
else
MODE="paired"
fi
echo "All checks passed. Starting pipeline (${MODE})..."

模板2:按比对率判断是否重新比对

Terminal window
MAPPING_RATE=$(samtools flagstat "${bam}" | \
grep "mapped (" | grep -oP '\d+\.\d+(?=%)' | head -1)
if [[ $(echo "${MAPPING_RATE} < 70" | bc -l) -eq 1 ]]; then
echo "WARNING: Low mapping rate (${MAPPING_RATE}%). Consider different aligner."
# 或者发邮件通知
fi

模板3:if-elif-else多分支

Terminal window
READ_LENGTH=$(seqkit stats "${fastq}" | tail -1 | awk '{print $7}' | cut -d. -f1)
if [[ "${READ_LENGTH}" -lt 50 ]]; then
ALIGNER="bowtie" # 超短reads
elif [[ "${READ_LENGTH}" -lt 150 ]]; then
ALIGNER="bwa" # 短reads
elif [[ "${READ_LENGTH}" -lt 1000 ]]; then
ALIGNER="minimap2 -x sr" # 中等长度
else
ALIGNER="minimap2 -x map-ont" # 长reads
fi
echo "Auto-selected aligner: ${ALIGNER}"

模板4:短路判断——一句话检查多个条件

Terminal window
# 检查所有必需软件
check_tool() {
command -v "$1" >/dev/null 2>&1 || {
echo "ERROR: $1 not installed"
exit 1
}
}
for tool in bwa samtools bcftools fastp seqkit; do
check_tool "${tool}"
done
echo "All tools available!"

模板5:正则匹配验证输入

Terminal window
SRA_ID="SRR12345678"
if [[ "${SRA_ID}" =~ ^(SRR|ERR|DRR)[0-9]{6,}$ ]]; then
echo "Valid SRA ID: ${SRA_ID}"
else
echo "ERROR: Invalid SRA ID format"
exit 1
fi

模板6:根据上一步退出码决定下一步

Terminal window
# 运行比对
bwa mem -t 16 ref.fa reads.fq > aln.sam
ALN_EXIT=$?
if [[ ${ALN_EXIT} -eq 0 ]]; then
echo "Alignment OK, sorting..."
samtools sort -@ 8 aln.sam -o aln.bam
else
echo "ERROR: Alignment failed with code ${ALN_EXIT}"
exit ${ALN_EXIT}
fi

4. case——多分支比if-elif更清晰#

当有3个以上分支时,caseif-elif-else 可读性好得多:

Terminal window
INPUT_FMT="${1:-fastq}"
case "${INPUT_FMT}" in
fastq|fq)
echo "FASTQ mode"
EXT="fastq.gz"
;;
bam|sam)
echo "BAM/SAM mode"
EXT="bam"
;;
vcf)
echo "VCF mode"
EXT="vcf.gz"
;;
*)
echo "Unknown format: ${INPUT_FMT}"
echo "Supported: fastq, bam, vcf"
exit 1
;;
esac

生信中适合case的场景:

Terminal window
# 根据文件扩展名判断操作
for file in *; do
case "${file}" in
*.fastq.gz|*.fq.gz)
zcat "${file}" | wc -l
;;
*.bam)
samtools flagstat "${file}"
;;
*.vcf.gz)
bcftools stats "${file}" | head -5
;;
*.log)
tail -20 "${file}"
;;
esac
done

5. 生信全流程实例:Bash批量RNA-seq预处理#

#!/bin/bash
set -euo pipefail
# ========== 配置 ==========
DATA_DIR="./raw_data"
OUT_DIR="./processed"
REF="/opt/refs/hg38.fa"
THREADS=16
FAILED_LOG="failed_samples.txt"
> "${FAILED_LOG}" # 清空失败日志
# ========== 预检查 ==========
for tool in fastp hisat2 samtools; do
if ! command -v "${tool}" >/dev/null 2>&1; then
echo "ERROR: ${tool} not found in PATH"
exit 1
fi
done
[[ -f "${REF}" ]] || { echo "ERROR: Ref genome missing"; exit 1; }
mkdir -p "${OUT_DIR}/qc_reports" "${OUT_DIR}/bam" "${OUT_DIR}/logs"
# ========== 主循环 ==========
R1_FILES=("${DATA_DIR}"/*_R1.fastq.gz)
TOTAL_SAMPLES=${#R1_FILES[@]}
CURRENT=0
for r1 in "${R1_FILES[@]}"; do
((CURRENT++))
# --- 配对R2 ---
r2="${r1/_R1/_R2}"
sample=$(basename "${r1}" | sed 's/_R1.fastq.gz//')
if [[ ! -f "${r2}" ]]; then
echo "[${CURRENT}/${TOTAL_SAMPLES}] SKIP ${sample}: R2 missing" | tee -a "${FAILED_LOG}"
continue
fi
echo "[${CURRENT}/${TOTAL_SAMPLES}] Processing ${sample}..."
# --- fastp质控 ---
if [[ -f "${OUT_DIR}/qc_reports/${sample}_fastp.json" ]]; then
echo " QC report exists, skipping fastp"
else
fastp -i "${r1}" -I "${r2}" \
-o "${OUT_DIR}/${sample}_R1.fq.gz" \
-O "${OUT_DIR}/${sample}_R2.fq.gz" \
-j "${OUT_DIR}/qc_reports/${sample}_fastp.json" \
-h "${OUT_DIR}/qc_reports/${sample}_fastp.html" \
-w "${THREADS}" \
2>&1 | tee "${OUT_DIR}/logs/${sample}_fastp.log"
fi
# --- HISAT2比对 ---
bam="${OUT_DIR}/bam/${sample}.bam"
if [[ -f "${bam}" ]] && [[ -s "${bam}" ]]; then
echo " BAM exists, skipping alignment"
else
hisat2 -p "${THREADS}" -x "${REF%.*}" \
-1 "${OUT_DIR}/${sample}_R1.fq.gz" \
-2 "${OUT_DIR}/${sample}_R2.fq.gz" \
2> "${OUT_DIR}/logs/${sample}_align.log" | \
samtools sort -@ "${THREADS}" -o "${bam}" -
if [[ $? -ne 0 ]]; then
echo "[${CURRENT}/${TOTAL_SAMPLES}] FAILED: ${sample}" | tee -a "${FAILED_LOG}"
continue
fi
samtools index "${bam}"
fi
# --- 比对率检查 ---
mapping_rate=$(grep "overall alignment rate" "${OUT_DIR}/logs/${sample}_align.log" | \
grep -oP '\d+\.\d+%' | head -1)
echo " ${sample}: mapping rate = ${mapping_rate}"
# --- 低比对率警告 ---
if [[ -n "${mapping_rate}" ]]; then
rate_num="${mapping_rate%\%}"
if [[ $(echo "${rate_num} < 70" | bc -l) -eq 1 ]]; then
echo " ⚠ WARNING: Low mapping rate for ${sample}!" | tee -a "low_mapping_samples.txt"
fi
fi
done
# ========== 汇总 ==========
echo ""
echo "=========================================="
echo "Pipeline complete!"
echo "Total samples: ${TOTAL_SAMPLES}"
echo "Failed: $(wc -l < ${FAILED_LOG})"
echo "=========================================="

这个脚本涵盖了本篇几乎所有知识点:for循环、if判断、条件跳过、退出码检查、数组、变量展开、日志输出。

6. 踩坑记录#

坑1:for file in *.fastq.gz 没有匹配时变成字面字符串#

症状:当目录下没有 .fastq.gz 文件时,for f in *.fastq.gz; do 循环依然执行一次,$f 的值是字面量 *.fastq.gz

Terminal window
# ✓ 解决:先检查或用 shopt
shopt -s nullglob # 没匹配时展开为空
for f in *.fastq.gz; do
echo "Processing $f"
done
shopt -u nullglob # 恢复默认
# 或者先判断
files=(*.fastq.gz)
if [[ ! -e "${files[0]}" ]]; then
echo "No FASTQ files found"; exit 1
fi

坑2:while read 最后一行不处理(缺换行符)#

症状:文件最后一行没有换行,while read 就读不到它。

Terminal window
# ✓ 用 || [[ -n "$line" ]] 兜底
while IFS= read -r line || [[ -n "$line" ]]; do
echo "$line"
done < file.txt

坑3:if [ $a == $b ] 中变量为空导致语法错误#

症状:a=""[ $a == "hello" ] 展开成 [ == hello ][ 命令报语法错误。

Terminal window
# ✓ 用 [[ ]](推荐)或加引号
[[ $a == "hello" ]] # 内置,安全
[ "$a" == "hello" ] # 传统,必须加引号

坑4:for 遍历命令输出时空格/换行问题#

症状:for f in $(find . -name "*.bam") 遇到文件名含空格就拆开。

Terminal window
# ✓ 用 while read 或 find -print0 + while read -d ''
find . -name "*.bam" -print0 | while IFS= read -r -d '' f; do
echo "Processing: ${f}"
done
# 或启用 globstar 替代 find
shopt -s globstar
for f in **/*.bam; do
echo "Processing: ${f}"
done

坑5:循环内修改全局变量在管道中丢失#

已在 2.2 节详述。补充一个进程替换的解决方案:

Terminal window
# 如果必须用管道,用进程替换保持变量作用域
total=0
while read -r n; do
((total += n))
done < <(cat counts.txt) # 进程替换,不是管道
echo "${total}" # 正确

坑6:break 只跳出最内层循环#

症状:嵌套循环里 break 只跳出内层。

Terminal window
# 跳出多层用 break N
for i in {1..5}; do
for j in {1..5}; do
if [[ $i -eq 3 && $j -eq 3 ]]; then
break 2 # 跳出两层
fi
done
done

坑7:(( i++ ))i 未初始化#

Bash中未初始化的变量在算术运算中当0处理,这点OK。但如果在 set -u 下:

Terminal window
set -u
for f in *.txt; do
(( count++ )) # 报错:count: unbound variable
done
# ✓ 先初始化
count=0
for f in *.txt; do
(( count++ ))
done
echo "${count}"

7. 总结#

需求用这个一句话
遍历固定列表for i in list简单直接
按数量/范围循环for ((i=1;i<=N;i++))C风格循环
逐行读文件while read -r处理大样本列表
等待条件满足while true / until监控类任务
简单判断if [[ condition ]]90%的场景
多分支case ... esac>3个分支时用
根据上一步结果if command; then直接用命令退出码

循环和条件判断是Shell脚本的骨架。把上面的6+6+2=14个模板收藏好,下次写生信脚本时直接复制改参数,比从头写快10倍。

这两个符号记牢:[[ ]](( ))——前者是字符串/文件的测试,后者是算术运算。别跟他们死磕,注意:字符串用 [[ ]],数字用 (( ))


本文于 2025-03-15 在 Debian 12 (Bash 5.2) 上实测完成。所有代码可直接运行。

文章分享

如果这篇文章对你有帮助,欢迎分享给更多人!

Bash循环与条件判断:for/if/case全解
https://fg.ink/posts/bash-loops-conditionals/
作者
风观
发布于
2024-05-01
许可协议
CC BY-NC-SA 4.0
Profile Image of the Author
风观
风有来路,观有所思
分类
标签
站点统计
文章
50
分类
1
标签
29
总字数
61,837
运行时长
0
最后活动
0 天前

文章目录