nf-core · nourmahfel · Nov 13, 2025 · Sep 15, 2025 · Sep 15, 2025 · Sep 15, 2025
diff --git a/README.md b/README.md
@@ -1,38 +1,118 @@
-# ![nfcore/test-datasets](docs/images/test-datasets_logo.png)
-Test data to be used for automated testing with the nf-core pipelines
+# 🧬 Rare Disease Test Datasets
 
-> ⚠️ **Do not merge your test data to `master`! Each pipeline has a dedicated branch (and a special one for modules)**
+This repository provides subsampled long-read sequencing datasets\*\* derived from HG002, designed for testing and validation of long-read rare disease analysis pipelines [nf-core/longraredisease](https://github.com/nf-core/longraredisease).
 
-## Introduction
+All datasets are restricted to **chromosome 22 (first 50 Mb)** to minimise file sizes and speed up automated test runs.
 
-nf-core is a collection of high quality Nextflow pipelines. This repository contains various files for CI and unit testing of nf-core pipelines and infrastructure.
+---
 
-The principle for nf-core test data is as small as possible, as large as necessary. Please see the [guidelines](https://nf-co.re/docs/contributing/test_data_guidelines) for more detailed information. Always ask for guidance on the [nf-core slack](https://nf-co.re/join) before adding new test data.
+## 📂 Repository Contents
 
-## Documentation
+| Folder / File       | Description                                                                                |
+| ------------------- | ------------------------------------------------------------------------------------------ |
+| `ubam_file/`        | Subsampled **unmapped BAM** files (uBAMs) for testing variant calling from unaligned data. |
+| `fastq_file/`       | Subsampled **FASTQ** file generated from HG002 basecalled reads.                           |
+| `spectre/`          | Example **VCF** and **BED** files for CNV detection testing with _Spectre_.                |
+| `straglr/`          | **STR test regions** (chromosome 22) for _STRaglr_ validation.                             |
+| `hificnv/`          | **Exclude BED** regions used for chromosome 22 CNV benchmarking.                           |
+| `reference/`        | Reduced **human genome reference**, containing only chromosome 22 (GRCh38).                |
+| `samplesheet_*.csv` | Example **sample metadata** for automated pipeline test runs.                              |
 
-nf-core/test-datasets comes with documentation in the `docs/` directory:
+---
 
-01. [Add a new  test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md)
-02. [Use an existing test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md)
+## 🧪 Sample Overview
 
-## Downloading test data
+| Column                                    | Description                           |
+| ----------------------------------------- | ------------------------------------- |
+| `sample_id`                               | Unique identifier for the test sample |
+| `input_type`                              | Input data type (FASTQ, BAM, etc.)    |
+| `file_path`                               | Direct download link to test data     |
+| `hpo_terms`                               | Associated HPO phenotype terms        |
+| `sex`                                     | Biological sex                        |
+| `family_id`, `maternal_id`, `paternal_id` | Family metadata                       |
 
-Due the large number of large files in this repository for each pipeline, we highly recommend cloning only the branches you would use.
+Example entry:
+
+```
+sample_id,input_type,file_path,hpo_terms,sex,family_id,maternal_id,paternal_id
+test,fastq,https://raw.githubusercontent.com/nourmahfel/test-datasets/longraredisease/fastq_file/hg002_subset.fastq.gz,HP:0002721;HP:0002110;HP:0500093;HP:0000717;HP:0001263;HP:0001763;HP:0003298;HP:0002857;HP:0001382,F,family_21,null,null
+```
+
+---
+
+## ⚙️ Usage
+
+These datasets are intended for **automated pipeline testing**, enabling quick validation of the full _long-read rare disease analysis_ workflow — from unaligned reads through to variant calling and annotation.
+
+Example Nextflow test run:
 
 ```bash
-git clone <url> --single-branch --branch <pipeline/modules/branch_name>
+nextflow run nf-core/nanoraredx -profile test,docker
 ```
 
-To subsequently clone other branches[^1]
+The repository includes a `test.config` file containing preset paths and parameters used for CI and development validation.
+
+---
+
+## 🧩 Data Generation Workflow
+
+The following steps describe how each dataset was created from **HG002 data**.
+
+### 1️⃣ Extract 50 Mb region from chromosome 22
+
+A compact subset was created to minimise storage and runtime while preserving data realism:
 
 ```bash
-git remote set-branches --add origin [remote-branch]
-git fetch
+samtools view -b calls.sorted.bam chr22:1-50000000 > chr22_50mb.bam
+samtools index chr22_50mb.bam
 ```
 
-## Support
+This produced a **50 Mb** region representing chromosome 22 (`chr22_50mb.bam`).
+
+---
+
+### 2️⃣ Generate FASTQ file
+
+The BAM file was converted to FASTQ format to simulate basecalled reads:
+
+```bash
+samtools fastq chr22_50mb.bam > hg002_subset.fastq
+gzip hg002_subset.fastq
+```
+
+Output:
+
+- `hg002_subset.fastq.gz` → FASTQ dataset for testing pipeline entry from raw reads.
+
+---
+
+### 3️⃣ Create unmapped BAM (uBAM)
+
+To test the alignment and variant calling stages from unaligned data, an **unmapped BAM** version was generated:
+
+```bash
+samtools view -h chr22_50mb.bam | awk '$3=="*" || /^@/' | samtools view -b -o hg002_subset.ubam
+```
+
+This file retains read names, qualities, and tags but removes alignment fields (RNAME, POS, CIGAR, etc.).
+
+Output:
+
+- `hg002_subset.ubam` → unaligned BAM file suitable for pipeline tests starting from mapping.
+
+---
+
+## 📦 Summary of Derived Outputs
+
+| File                    | Description                        | Source           |
+| ----------------------- | ---------------------------------- | ---------------- |
+| `hg002_subset.fastq.gz` | Subsampled FASTQ (50 Mb region)    | `chr22_50mb.bam` |
+| `hg002_subset.ubam`     | Unmapped BAM for alignment testing | `chr22_50mb.bam` |
+| `reference/chr22.fasta` | Reduced genome reference           | GRCh38           |
+
+---
 
-For further information or help, don't hesitate to get in touch on our [Slack organisation](https://nf-co.re/join/slack) (a tool for instant messaging).
+## 📄 License and Attribution
 
-[^1]: From [stackoverflow](https://stackoverflow.com/a/60846265/11502856)
+Data derived from **HG002** (Genome in a Bottle Consortium).  
+Please cite **GIAB** and relevant tools when reusing or redistributing these datasets.
diff --git a/fastq_file/hg002_subset.fastq.gz b/fastq_file/hg002_subset.fastq.gz