Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 99 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,118 @@
# ![nfcore/test-datasets](docs/images/test-datasets_logo.png)
Test data to be used for automated testing with the nf-core pipelines
# 🧬 Rare Disease Test Datasets

> ⚠️ **Do not merge your test data to `master`! Each pipeline has a dedicated branch (and a special one for modules)**
This repository provides subsampled long-read sequencing datasets\*\* derived from HG002, designed for testing and validation of long-read rare disease analysis pipelines [nf-core/longraredisease](https://github.com/nf-core/longraredisease).

## Introduction
All datasets are restricted to **chromosome 22 (first 50 Mb)** to minimise file sizes and speed up automated test runs.

nf-core is a collection of high quality Nextflow pipelines. This repository contains various files for CI and unit testing of nf-core pipelines and infrastructure.
---

The principle for nf-core test data is as small as possible, as large as necessary. Please see the [guidelines](https://nf-co.re/docs/contributing/test_data_guidelines) for more detailed information. Always ask for guidance on the [nf-core slack](https://nf-co.re/join) before adding new test data.
## 📂 Repository Contents

## Documentation
| Folder / File | Description |
| ------------------- | ------------------------------------------------------------------------------------------ |
| `ubam_file/` | Subsampled **unmapped BAM** files (uBAMs) for testing variant calling from unaligned data. |
| `fastq_file/` | Subsampled **FASTQ** file generated from HG002 basecalled reads. |
| `spectre/` | Example **VCF** and **BED** files for CNV detection testing with _Spectre_. |
| `straglr/` | **STR test regions** (chromosome 22) for _STRaglr_ validation. |
| `hificnv/` | **Exclude BED** regions used for chromosome 22 CNV benchmarking. |
| `reference/` | Reduced **human genome reference**, containing only chromosome 22 (GRCh38). |
| `samplesheet_*.csv` | Example **sample metadata** for automated pipeline test runs. |

nf-core/test-datasets comes with documentation in the `docs/` directory:
---

01. [Add a new test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md)
02. [Use an existing test dataset](https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md)
## 🧪 Sample Overview

## Downloading test data
| Column | Description |
| ----------------------------------------- | ------------------------------------- |
| `sample_id` | Unique identifier for the test sample |
| `input_type` | Input data type (FASTQ, BAM, etc.) |
| `file_path` | Direct download link to test data |
| `hpo_terms` | Associated HPO phenotype terms |
| `sex` | Biological sex |
| `family_id`, `maternal_id`, `paternal_id` | Family metadata |

Due the large number of large files in this repository for each pipeline, we highly recommend cloning only the branches you would use.
Example entry:

```
sample_id,input_type,file_path,hpo_terms,sex,family_id,maternal_id,paternal_id
test,fastq,https://raw.githubusercontent.com/nourmahfel/test-datasets/longraredisease/fastq_file/hg002_subset.fastq.gz,HP:0002721;HP:0002110;HP:0500093;HP:0000717;HP:0001263;HP:0001763;HP:0003298;HP:0002857;HP:0001382,F,family_21,null,null
```

---

## ⚙️ Usage

These datasets are intended for **automated pipeline testing**, enabling quick validation of the full _long-read rare disease analysis_ workflow — from unaligned reads through to variant calling and annotation.

Example Nextflow test run:

```bash
git clone <url> --single-branch --branch <pipeline/modules/branch_name>
nextflow run nf-core/nanoraredx -profile test,docker
```

To subsequently clone other branches[^1]
The repository includes a `test.config` file containing preset paths and parameters used for CI and development validation.

---

## 🧩 Data Generation Workflow

The following steps describe how each dataset was created from **HG002 data**.

### 1️⃣ Extract 50 Mb region from chromosome 22

A compact subset was created to minimise storage and runtime while preserving data realism:

```bash
git remote set-branches --add origin [remote-branch]
git fetch
samtools view -b calls.sorted.bam chr22:1-50000000 > chr22_50mb.bam
samtools index chr22_50mb.bam
```

## Support
This produced a **50 Mb** region representing chromosome 22 (`chr22_50mb.bam`).

---

### 2️⃣ Generate FASTQ file

The BAM file was converted to FASTQ format to simulate basecalled reads:

```bash
samtools fastq chr22_50mb.bam > hg002_subset.fastq
gzip hg002_subset.fastq
```

Output:

- `hg002_subset.fastq.gz` → FASTQ dataset for testing pipeline entry from raw reads.

---

### 3️⃣ Create unmapped BAM (uBAM)

To test the alignment and variant calling stages from unaligned data, an **unmapped BAM** version was generated:

```bash
samtools view -h chr22_50mb.bam | awk '$3=="*" || /^@/' | samtools view -b -o hg002_subset.ubam
```

This file retains read names, qualities, and tags but removes alignment fields (RNAME, POS, CIGAR, etc.).

Output:

- `hg002_subset.ubam` → unaligned BAM file suitable for pipeline tests starting from mapping.

---

## 📦 Summary of Derived Outputs

| File | Description | Source |
| ----------------------- | ---------------------------------- | ---------------- |
| `hg002_subset.fastq.gz` | Subsampled FASTQ (50 Mb region) | `chr22_50mb.bam` |
| `hg002_subset.ubam` | Unmapped BAM for alignment testing | `chr22_50mb.bam` |
| `reference/chr22.fasta` | Reduced genome reference | GRCh38 |

---

For further information or help, don't hesitate to get in touch on our [Slack organisation](https://nf-co.re/join/slack) (a tool for instant messaging).
## 📄 License and Attribution

[^1]: From [stackoverflow](https://stackoverflow.com/a/60846265/11502856)
Data derived from **HG002** (Genome in a Bottle Consortium).
Please cite **GIAB** and relevant tools when reusing or redistributing these datasets.
Binary file added fastq_file/hg002_subset.fastq.gz
Binary file not shown.
Loading