NOD2 Expression Analysis in Zebrafish RNA-seq Replicates

Author: Noah Nicol
Date: March 2025

Overview

This document outlines the RNA-seq analysis process for three zebrafish normoxia samples (SRR19627923, SRR19627924, SRR19627925). The analysis includes data acquisition, quality control, alignment, expression quantification, and visualization, with a specific focus on NOD2 and housekeeping genes for comparison.

The analysis extracts raw gene counts from featureCounts output files, calculates FPKM (Fragments Per Kilobase Million) values, and compares expression levels across the RNA-seq replicates.

Required Software and Packages

Terminal Tools

  1. SRA Toolkit: For downloading FASTQ files
  2. FastQC: For quality control
  3. Trimmomatic: For trimming low-quality reads
  4. HISAT2: For aligning reads to the genome
  5. Samtools: For SAM/BAM file processing
  6. Subread (featureCounts): For expression quantification

Python Packages

Install using pip or conda:

pip install matplotlib seaborn pandas numpy jupyter

RNA-seq Analysis Workflow

1. Data Acquisition

prefetch SRR19627925
fasterq-dump SRR19627925
fastqc SRR19627925_1.fastq SRR19627925_2.fastq

Note: Because these samples are standard bulk RNA reads, I did no deduplication (generally used to reduce PCR amplification bias). The applicable warnings were ‘Per base sequence content’ (which looked like what I would expect) and ‘Adapter Content’ (which I tried to adjust, but was challenged in finding the specific sequence corresponding to the Illumina Universal Adapter). Fig 1. Per base sequence content. Fig 2. Adapter Content.

trimmomatic PE SRR19627925_1.fastq SRR19627925_2.fastq \
  SRR19627925_1_paired_trimmed.fastq SRR19627925_1_unpaired_trimmed.fastq \
  SRR19627925_2_paired_trimmed.fastq SRR19627925_2_unpaired_trimmed.fastq \
  AVGQUAL:20 TRAILING:20 MINLEN:50

2. Genome Alignment

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.6_GRCz11/GCF_000002035.6_GRCz11_genomic.fna.gz
gunzip GCF_000002035.6_GRCz11_genomic.fna.gz
hisat2-build GCF_000002035.6_GRCz11_genomic.fna zebrafish_index
hisat2 -x zebrafish_index \
  -1 SRR19627925_1_paired_trimmed.fastq \
  -2 SRR19627925_2_paired_trimmed.fastq \
  -S SRR19627925.sam \
  --summary-file SRR19627925_alignment_summary.txt

3. Post-alignment Processing

samtools view -bS SRR19627925.sam -o SRR19627925.bam
samtools sort SRR19627925.bam -o SRR19627925_sorted.bam
samtools index SRR19627925_sorted.bam

4. Expression Quantification

featureCounts -p -t exon -g gene_id \
  -a genomic.gtf -o gene_counts.txt SRR19627925_sorted.bam

Analysis Scripts

After generating the gene count files for each replicate, I created custom scripts to extract and analyze the expression data:

  1. get_gene_counts.py: Extracts specific raw gene counts from the featureCounts output files
python get_gene_counts.py
  1. calculate_fpkm.py: Calculates FPKM values from gene counts and creates comparison tables with visualizations
python calculate_fpkm.py
  1. NOD2_expression.ipynb: Jupyter notebook for exploring and visualizing the results
jupyter notebook NOD2_expression.ipynb

Genes Analyzed

The analysis focused on the following genes:

  1. Housekeeping genes:
  2. Target gene:

FPKM Calculation Method

FPKM (Fragments Per Kilobase Million) values were calculated using the standard formula:

# Example FPKM calculation for NOD2
NOD2reads = 220
length_bases = 4561  
length_kb = length_bases / 1000  # Convert to kilobases
total_reads = 29743387

# FPKM formula
fpkm = (NOD2reads * 1e9) / (length_kb * total_mapped)

Expression Level Categories

The genes were categorized based on their FPKM values: - Very High: FPKM > 100 - High: 10 < FPKM ≤ 100 - Moderate: 1 < FPKM ≤ 10 - Low: FPKM ≤ 1

Results and Analysis

Key Findings

  1. NOD2 Expression:
  2. Gene Expression Patterns:
  3. Replicate Consistency:

Expression Summary Table

Gene Mean FPKM Expression Level
actb2 3221.67 Very High
b2m 767.95 Very High
actb1 324.06 Very High
rpl13a 402.77 Very High
sqstm1 81.74 High
hif1ab 28.79 High
map1lc3b 25.64 High
tbp 21.50 High
map1lc3a 15.93 High
nod2 1.88 Moderate
tnfa 0.23 Low
hif1aa 0.20 Low
il6 0.03 Low
il1b 0.03 Low
gapdh 0.02 Low

Data Visualization

The analysis included several visualizations: - Bar plots of FPKM values across genes - Heatmap showing expression patterns across replicates - Coefficient of variation analysis for replicate consistency

Gene Expression Comparison Expression Heatmap

Conclusion

This analysis provides a comprehensive analysis of NOD2 expression in zebrafish under normoxic conditions. The moderate expression level of NOD2 (FPKM ~1.9) suggests that it plays a functional role in the ZFL line under atmospheric conditions.

The expression patterns observed align with expectations: housekeeping genes are highly expressed, inflammatory genes show low expression in normoxia, and autophagy markers show basal expression.

References

[1] Eltzschig HK, Carmeliet P. Hypoxia and inflammation. N Engl J Med. 2011 Feb 17;364(7):656-65. doi: 10.1056/NEJMra0910283. PMID: 21323543; PMCID: PMC3930928.