Skip to content

Prokash21/DeepFoldChange

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 DeepFoldChange

A Deep Learning Framework that Emulates Statistical Models for Differential Gene Expression Analysis


🌍 Background and Motivation

Traditional RNA-seq differential expression (DE) analysis relies on statistical tools such as DESeq2, edgeR, and limma-voom.
While these methods are robust, they can be computationally intensive, require repeated normalization, and are not easily generalizable across datasets.

DeepFoldChange offers a new perspective β€” a deep neural network (DNN) framework that learns to imitate the statistical inference process used in DESeq2.
By training on DESeq2-derived log2FoldChange (LFC) values and expression count matrices, DeepFold accurately predicts fold-changes for unseen genes, effectively serving as an AI-driven surrogate model for DE analysis.


⚑ Key Features

  • 🧠 Deep Neural Emulation β€” Learns DESeq2-style fold-change behavior directly from count matrices.
  • 🧩 Lightweight & Scalable β€” Uses PCA-compressed expression features for faster training.
  • πŸ“Š Model Evaluation Suite β€” Automatically generates regression, error, and agreement plots.
  • πŸ’» Reproducible Pipeline β€” Compatible with any RNA-seq dataset that includes count and LFC tables.

πŸ“ Repository Structure

DeepFold/
β”‚
β”œβ”€β”€ train_predict_lfc_dnn.py           # Main training + prediction script
β”œβ”€β”€ plot_deepfold_performance.py       # Performance plotting utility
β”œβ”€β”€ requirements.txt                   # Package dependencies
β”œβ”€β”€ lfc_dnn_outputs/                   # Folder containing model outputs
β”‚   β”œβ”€β”€ holdout_predictions.csv
β”‚   β”œβ”€β”€ all_predictions_fullfit.csv
β”‚   β”œβ”€β”€ pca_dnn_lfc_predictor.pkl
β”‚   β”œβ”€β”€ metrics.txt
β”‚   β”œβ”€β”€ plot_true_vs_pred_scatter.png
β”‚   β”œβ”€β”€ plot_error_distribution.png
β”‚   β”œβ”€β”€ plot_bland_altman.png
β”‚   β”œβ”€β”€ plot_cumulative_accuracy.png
β”‚   β”œβ”€β”€ plot_accuracy_vs_threshold.png
β”‚   └── plot_accuracy_bar.png
└── README.md

🧩 Model Architecture

DeepFoldChange uses a PCA β†’ DNN regression pipeline:

Counts β†’ CPM normalization β†’ log1p β†’ PCA(≀100) β†’ 
DNN(hidden: 256β†’128β†’64, ReLU, Adam optimizer) β†’ Predicted LFC

πŸ“Š Example Results

1️⃣ True vs Predicted LFC

Pearson r = 0.99, Spearman ρ = 1.00, MAE = 0.10.
Predicted fold-changes align almost perfectly with DESeq2 results, demonstrating that DeepFold reproduces both the magnitude and direction of differential expression.


2️⃣ Prediction Error Distribution

Errors are sharply centered at 0, indicating no bias between predicted and true LFC values.

More than 95 % of genes fall within Β±0.5 logβ‚‚ fold-change error.


3️⃣ Bland–Altman Agreement Plot

Differences between predicted and true LFCs remain consistent across the entire range of expression changes β€” confirming uniform agreement and no proportional bias.


4️⃣ Cumulative Accuracy Curve

Over 90 % of genes are predicted within |Ξ”LFC| ≀ 0.5,
showing high fidelity between DeepFold and statistical estimates.


5️⃣ Accuracy vs. Error Threshold

This curve quantifies model accuracy as a function of tolerance.
For example, if the threshold is Β±0.25 LFC, the model correctly predicts ~80 % of genes.


πŸ§ͺ Evaluation Summary

Metric Value Interpretation
Pearson r 0.99 Linear correlation between predicted and true LFC
Spearman ρ 1.00 Rank-order agreement
MAE 0.10 Average absolute LFC difference
90 % genes ≀ 0.5 High-confidence accuracy

DeepFoldChange achieves statistical-model-level precision, validating its use as a deep learning surrogate for DE analysis.


βš™οΈ Installation & Environment Setup

Option 1: Conda (recommended)

conda create -n DeepFold_env python=3.10
conda activate deepfold_env
pip install -r requirements.txt

Option 2: Manual install

pip install pandas numpy scikit-learn scipy joblib matplotlib seaborn

πŸš€ Running the Pipeline

1️⃣ Prepare input files

  • filtered_counts_DEGs.csv – normalized or raw counts (rows = genes, columns = samples)
  • resLFC_p_cut.csv – DESeq2 output containing at least a log2FoldChange column.

2️⃣ Train and Predict

python train_predict_lfc_dnn.py   --counts filtered_counts_DEGs.csv   --lfc resLFC_p_cut.csv   --outdir lfc_dnn_outputs

This will:

  • train the PCA + DNN model,
  • evaluate on a hold-out set,
  • fit on all data,
  • and generate predictions + metrics.

3️⃣ Plot performance

python plot_deepfold_performance.py

All figures are saved inside lfc_dnn_outputs/.


πŸ“˜ Output Files Explained

File Description
holdout_predictions.csv True vs predicted LFC for test genes
all_predictions_fullfit.csv Predicted LFC for all DEGs
pca_dnn_lfc_predictor.pkl Serialized model (scaler + PCA + DNN)
metrics.txt RΒ², MAE, Spearman metrics summary
plot_*.png Performance and accuracy figures

🧩 How to Use Trained Model for New Data

Once trained, you can load the model and predict on new normalized counts:

import joblib, pandas as pd
model = joblib.load("lfc_dnn_outputs/pca_dnn_lfc_predictor.pkl")

new_counts = pd.read_csv("your_new_counts.csv", index_col=0)
predicted_lfc = model.predict(new_counts)

πŸ“ˆ Example Interpretation

DeepFoldChange closely approximates DESeq2-derived fold-changes with > 90 % accuracy for |Ξ”LFC| ≀ 0.5.
The framework offers a reproducible, fast, and model-agnostic solution for high-throughput DE prediction using deep learning.


✨ Citation (suggested)

If you use or adapt this repository, please cite:

Debnath, J. P., et al. (2025). DeepFoldChange: A Deep Learning Framework that Emulates Statistical Models for Differential Gene Expression Analysis. GitHub repository: https://github.com/Prokash21/DeepFoldChange

🧠 Author

Joy Prokash Debnath
Department of Biochemistry and Molecular Biology
Shahjalal University of Science and Technology, Sylhet, Bangladesh


🧾 License

This project is distributed under the MIT License – free for academic and research use.


Releases

No releases published

Packages

No packages published

Languages