From 029605b350b65c7f6f85396694323746f70e3a8a Mon Sep 17 00:00:00 2001 From: Trevor Manz Date: Thu, 7 Nov 2024 20:34:25 -0500 Subject: [PATCH 1/3] Upgrade notebook examples with inline metadata --- notebooks/getting-started.ipynb | 487 +++++++++++++++++--------------- pyproject.toml | 2 - uv.lock | 20 +- 3 files changed, 254 insertions(+), 255 deletions(-) diff --git a/notebooks/getting-started.ipynb b/notebooks/getting-started.ipynb index 8aa167c..62b1c8c 100644 --- a/notebooks/getting-started.ipynb +++ b/notebooks/getting-started.ipynb @@ -1,237 +1,256 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "8efc6d60-f207-4e54-92b0-a6070b0158b4", - "metadata": {}, - "source": [ - "# Getting Started\n", - "\n", - "In this notebook we're going to demonstrate how to use `cev` to compare (a) two _different_ embeddings of the same data and (b) two aligned embeddings of _different_ data.\n", - "\n", - "The embeddings we're exploring in this notebook represent single-cell surface proteomic data. In other words, each data point represents a individual cell whose surface protein expression was measured. The cells were then clustered into cellular phenotypes based on their protein expression." - ] + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python" + }, + "language_info": { + "name": "python", + "version": "3.11.10", + "codemirror_mode": { + "name": "ipython", + "version": 3 + } + }, + "widgets": {} }, - { - "cell_type": "code", - "execution_count": null, - "id": "47c31bea-24b3-4d16-a69a-a3ad3a746234", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "from cev.widgets import Embedding, EmbeddingComparisonWidget" - ] - }, - { - "cell_type": "markdown", - "id": "dea71d70-e467-49af-9165-6e278f953977", - "metadata": {}, - "source": [ - "The notebook requires downloading the three embeddings from data of from [Mair et al., 2022](https://www.nature.com/articles/s41586-022-04718-w):\n", - "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/)\n", - "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", - "- Tumor sample 6 (82 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", - "\n", - "All three embeddings are annotated with [Ozette's FAUST method](https://doi.org/10.1016/j.patter.2021.100372)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dbf802bc-f709-4163-9b49-8fa5f6ce59ab", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# download the data\n", - "!curl -sL https://figshare.com/ndownloader/articles/23063615/versions/1 -o data.zip\n", - "!unzip data.zip -d data" - ] - }, - { - "cell_type": "markdown", - "id": "e62390d2-1242-49a8-9780-be976d39fa42", - "metadata": { - "tags": [] - }, - "source": [ - "## Comparing Two Embeddings of the same Data\n", - "\n", - "In the first example, we are going to use `cev` to compare two different embeddings methods that were run on the very same data (the tissue sample): standard UMAP and annotation transformation UMAP.\n", - "\n", - "Different embedding methods can produce very different embedding spaces and it's often hard to assess the difference wholelistically. `cev` enables us to quantify two properties based on shared point labels:\n", - "\n", - "1. Confusion: the degree to which two or more labels are visually intermixed\n", - "2. Neighborhood: the degree to which the local neighborhood of a label has changed between the two embeddings\n", - "\n", - "Visualized as a heatmap, these two property can quickly guide us to point clusters that are better or less resolved in either one of the two embeddings. It can also help us find compositional changes between the two embeddings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7874813c-810f-40e5-92ab-91f228046a5e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tissue_umap_embedding = Embedding.from_ozette(\n", - " df=pd.read_parquet(\"./data/mair-2022-tissue-138-umap.pq\")\n", - ")\n", - "tissue_ozette_embedding = Embedding.from_ozette(\n", - " df=pd.read_parquet(\"./data/mair-2022-tissue-138-ozette.pq\")\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c3d7e114-9fd3-4785-bdca-e3f4bbf37df8", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "umap_vs_ozette = EmbeddingComparisonWidget(\n", - " tissue_umap_embedding,\n", - " tissue_ozette_embedding,\n", - " titles=[\"Standard UMAP (Tissue)\", \"Annotation-Transformed UMAP (Tissue)\"],\n", - " metric=\"confusion\",\n", - " selection=\"synced\",\n", - " auto_zoom=True,\n", - " row_height=320,\n", - ")\n", - "umap_vs_ozette" - ] - }, - { - "cell_type": "markdown", - "id": "a516d65a-351b-4365-a267-704cd93a9c0e", - "metadata": {}, - "source": [ - "In this example, we can see that the point labels are much more intermixed in the standard UMAP embedding compared to the annotation transformation UMAP. This not surprising as the standard UMAP embedding is not optimized for Flow cytometry data in any way and is thus only resolving broad cell phenotypes based on a few markers. You can see this by holding down `SHIFT` and clicking on `CD8` under _Markers_, which reduces the label resolution and shows that under a reduced label resolution, the confusion is much lower in the standard UMAP embedding.\n", - "\n", - "When selecting _Neighborhood_ from the _Metric_ drop down menu, we switch to the neighborhood composition difference quantification. When only a few markers (e.g., `CD4` and `CD8`) are active, we can see that most of the neighborhood remain unchanged. When we gradually add more markers, we can see how the the local neighborhood composition difference slowly increases, which is due to the fact that the annotation transformation spaces out all point label clusters.\n", - "\n", - "To study certain clusters or labels in detail, you can either interactively select points in the embedding via [jupyter-scatter](https://github.com/flekschas/jupyter-scatter)'s lasso selection or you can programmatically select points by their label via the `select()`. For instance, the next call will select all CD4+ T cells." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ba7a378f-4212-4953-be5b-7a273f8bc75e", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "umap_vs_ozette.select([\"CD3+\", \"CD4+\", \"CD8-\"])" - ] - }, - { - "cell_type": "markdown", - "id": "3c439e4d-0679-4e64-a1c7-4be93cbbe039", - "metadata": {}, - "source": [ - "## Size Differences Between _Non-Responder_ and _Responder_\n", - "\n", - "Instead of comparing identical data, let's take a look at two transformed and aligned embeddings: tissue vs tumor. The embeddings are both annotation-transformed and aligned, ensuring low confusion and high neighborhood similarity (check to confirm!). The abundance metric aids in identifying potential shifts in phenotype abundance, providing a comprehensive and visually intuitive method for analyzing complex cytometry data. Remember, our metric should be used as a exploratory tool guide exploration and quickly surface potentially interesting phenotypes, but robust statical methods must be applied to confirm whether any abundance differences exist." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "180f0945-d97c-4261-aa67-5368e3b560ad", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tumor_ozette_embedding = Embedding.from_ozette(\n", - " df=pd.read_parquet(\"./data/mair-2022-tumor-006-ozette.pq\")\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0f99361b-6e96-4a6d-ad65-0533c23bece7", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tissue_vs_tumor = EmbeddingComparisonWidget(\n", - " tissue_ozette_embedding,\n", - " tumor_ozette_embedding,\n", - " titles=[\"Tissue\", \"Tumor\"],\n", - " metric=\"abundance\",\n", - " selection=\"phenotype\",\n", - " auto_zoom=True,\n", - " row_height=320,\n", - ")\n", - "\n", - "tissue_vs_tumor" - ] - }, - { - "cell_type": "markdown", - "id": "6d632c95-dff8-4b90-b763-f3055c4e8047", - "metadata": { - "tags": [] - }, - "source": [ - "The following **CD8+ T cells** are more abundant in `tissue` (i.e., the relative abundance is higher on the left) compared to `tumor` (i.e., the relative abundance is lower on the right)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2f7ebd73-32e7-48ed-8575-8d14d2edc73f", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "tissue_vs_tumor.select(\n", - " \"CD4-CD8+CD3+CD45RA+CD27+CD19-CD103-CD28-CD69+PD1+HLADR-GranzymeB-CD25-ICOS-TCRgd-CD38-CD127-Tim3-\"\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eefac753-7920-4c87-99ef-d155f1ec5114", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "nbformat": 4, + "nbformat_minor": 5, + "cells": [ + { + "cell_type": "code", + "id": "b640237c", + "metadata": { + "jupyter": { + "source_hidden": true, + "outputs_hidden": null + } + }, + "execution_count": null, + "source": [ + "# /// script\n", + "# requires-python = \"==3.12\"\n", + "# dependencies = [\n", + "# \"cev\",\n", + "# \"pooch==1.8.2\",\n", + "# \"pyarrow\",\n", + "# ]\n", + "#\n", + "# [tool.uv.sources]\n", + "# cev = { path = \"../\" }\n", + "# ///" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "8efc6d60-f207-4e54-92b0-a6070b0158b4", + "metadata": {}, + "source": [ + "# Getting Started\n", + "\n", + "In this notebook we're going to demonstrate how to use `cev` to compare (a) two _different_ embeddings of the same data and (b) two aligned embeddings of _different_ data.\n", + "\n", + "The embeddings we're exploring in this notebook represent single-cell surface proteomic data. In other words, each data point represents a individual cell whose surface protein expression was measured. The cells were then clustered into cellular phenotypes based on their protein expression." + ], + "attachments": null + }, + { + "cell_type": "code", + "id": "47c31bea-24b3-4d16-a69a-a3ad3a746234", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "import zipfile\n", + "import pandas as pd\n", + "import pooch\n", + "\n", + "from cev.widgets import Embedding, EmbeddingComparisonWidget" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "dea71d70-e467-49af-9165-6e278f953977", + "metadata": {}, + "source": [ + "The notebook requires downloading the three embeddings from data of from [Mair et al., 2022](https://www.nature.com/articles/s41586-022-04718-w):\n", + "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/)\n", + "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", + "- Tumor sample 6 (82 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", + "\n", + "All three embeddings are annotated with [Ozette's FAUST method](https://doi.org/10.1016/j.patter.2021.100372)." + ], + "attachments": null + }, + { + "cell_type": "code", + "id": "dbf802bc-f709-4163-9b49-8fa5f6ce59ab", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "archive = pooch.retrieve(\n", + " url=\"https://figshare.com/ndownloader/articles/23063615/versions/1\",\n", + " path=pooch.os_cache(\"cev\"),\n", + " fname=\"data.zip\",\n", + " known_hash=None,\n", + ")" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "e62390d2-1242-49a8-9780-be976d39fa42", + "metadata": { + "tags": [] + }, + "source": [ + "## Comparing Two Embeddings of the same Data\n", + "\n", + "In the first example, we are going to use `cev` to compare two different embeddings methods that were run on the very same data (the tissue sample): standard UMAP and annotation transformation UMAP.\n", + "\n", + "Different embedding methods can produce very different embedding spaces and it's often hard to assess the difference wholelistically. `cev` enables us to quantify two properties based on shared point labels:\n", + "\n", + "1. Confusion: the degree to which two or more labels are visually intermixed\n", + "2. Neighborhood: the degree to which the local neighborhood of a label has changed between the two embeddings\n", + "\n", + "Visualized as a heatmap, these two property can quickly guide us to point clusters that are better or less resolved in either one of the two embeddings. It can also help us find compositional changes between the two embeddings." + ], + "attachments": null + }, + { + "cell_type": "code", + "id": "7874813c-810f-40e5-92ab-91f228046a5e", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "with zipfile.ZipFile(archive, \"r\") as z:\n", + " with z.open(\"mair-2022-tissue-138-umap.pq\") as f:\n", + " tissue_umap_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)\n", + " with z.open(\"mair-2022-tissue-138-ozette.pq\") as f:\n", + " tissue_ozette_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)\n", + " with z.open(\"mair-2022-tumor-006-ozette.pq\") as f:\n", + " tumor_ozette_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)" + ], + "outputs": [] + }, + { + "cell_type": "code", + "id": "c3d7e114-9fd3-4785-bdca-e3f4bbf37df8", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "umap_vs_ozette = EmbeddingComparisonWidget(\n", + " tissue_umap_embedding,\n", + " tissue_ozette_embedding,\n", + " titles=[\"Standard UMAP (Tissue)\", \"Annotation-Transformed UMAP (Tissue)\"],\n", + " metric=\"confusion\",\n", + " selection=\"synced\",\n", + " auto_zoom=True,\n", + " row_height=320,\n", + ")\n", + "umap_vs_ozette" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "a516d65a-351b-4365-a267-704cd93a9c0e", + "metadata": {}, + "source": [ + "In this example, we can see that the point labels are much more intermixed in the standard UMAP embedding compared to the annotation transformation UMAP. This not surprising as the standard UMAP embedding is not optimized for Flow cytometry data in any way and is thus only resolving broad cell phenotypes based on a few markers. You can see this by holding down `SHIFT` and clicking on `CD8` under _Markers_, which reduces the label resolution and shows that under a reduced label resolution, the confusion is much lower in the standard UMAP embedding.\n", + "\n", + "When selecting _Neighborhood_ from the _Metric_ drop down menu, we switch to the neighborhood composition difference quantification. When only a few markers (e.g., `CD4` and `CD8`) are active, we can see that most of the neighborhood remain unchanged. When we gradually add more markers, we can see how the the local neighborhood composition difference slowly increases, which is due to the fact that the annotation transformation spaces out all point label clusters.\n", + "\n", + "To study certain clusters or labels in detail, you can either interactively select points in the embedding via [jupyter-scatter](https://github.com/flekschas/jupyter-scatter)'s lasso selection or you can programmatically select points by their label via the `select()`. For instance, the next call will select all CD4+ T cells." + ], + "attachments": null + }, + { + "cell_type": "code", + "id": "ba7a378f-4212-4953-be5b-7a273f8bc75e", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "umap_vs_ozette.select([\"CD3+\", \"CD4+\", \"CD8-\"])" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "3c439e4d-0679-4e64-a1c7-4be93cbbe039", + "metadata": {}, + "source": [ + "## Size Differences Between _Non-Responder_ and _Responder_\n", + "\n", + "Instead of comparing identical data, let's take a look at two transformed and aligned embeddings: tissue vs tumor. The embeddings are both annotation-transformed and aligned, ensuring low confusion and high neighborhood similarity (check to confirm!). The abundance metric aids in identifying potential shifts in phenotype abundance, providing a comprehensive and visually intuitive method for analyzing complex cytometry data. Remember, our metric should be used as a exploratory tool guide exploration and quickly surface potentially interesting phenotypes, but robust statical methods must be applied to confirm whether any abundance differences exist." + ], + "attachments": null + }, + { + "cell_type": "code", + "id": "0f99361b-6e96-4a6d-ad65-0533c23bece7", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "tissue_vs_tumor = EmbeddingComparisonWidget(\n", + " tissue_ozette_embedding,\n", + " tumor_ozette_embedding,\n", + " titles=[\"Tissue\", \"Tumor\"],\n", + " metric=\"abundance\",\n", + " selection=\"phenotype\",\n", + " auto_zoom=True,\n", + " row_height=320,\n", + ")\n", + "\n", + "tissue_vs_tumor" + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "6d632c95-dff8-4b90-b763-f3055c4e8047", + "metadata": { + "tags": [] + }, + "source": [ + "The following **CD8+ T cells** are more abundant in `tissue` (i.e., the relative abundance is higher on the left) compared to `tumor` (i.e., the relative abundance is lower on the right)" + ], + "attachments": null + }, + { + "cell_type": "code", + "id": "2f7ebd73-32e7-48ed-8575-8d14d2edc73f", + "metadata": { + "tags": [] + }, + "execution_count": null, + "source": [ + "tissue_vs_tumor.select(\n", + " \"CD4-CD8+CD3+CD45RA+CD27+CD19-CD103-CD28-CD69+PD1+HLADR-GranzymeB-CD25-ICOS-TCRgd-CD38-CD127-Tim3-\"\n", + ")" + ], + "outputs": [] + }, + { + "cell_type": "code", + "id": "eefac753-7920-4c87-99ef-d155f1ec5114", + "metadata": {}, + "execution_count": null, + "source": [], + "outputs": [] + } + ] } diff --git a/pyproject.toml b/pyproject.toml index 1ec0195..b38e22a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -28,8 +28,6 @@ dependencies = [ "jupyter-scatter>=0.14.0", "pandas>=1.0,<2.0", "numpy>=1.0,<2.0", - "pyarrow", - "pooch>=1.3.0", ] dynamic = ["version"] diff --git a/uv.lock b/uv.lock index 66536cc..01379c6 100644 --- a/uv.lock +++ b/uv.lock @@ -185,7 +185,7 @@ wheels = [ [[package]] name = "cev" -version = "0.2.3.dev2+g917526a.d20241017" +version = "0.2.3.dev3+g74791a0.d20241108" source = { editable = "." } dependencies = [ { name = "anywidget" }, @@ -195,8 +195,6 @@ dependencies = [ { name = "jupyter-scatter" }, { name = "numpy" }, { name = "pandas" }, - { name = "pooch" }, - { name = "pyarrow" }, ] [package.optional-dependencies] @@ -224,8 +222,6 @@ requires-dist = [ { name = "matplotlib", marker = "extra == 'notebooks'" }, { name = "numpy", specifier = ">=1.0,<2.0" }, { name = "pandas", specifier = ">=1.0,<2.0" }, - { name = "pooch", specifier = ">=1.3.0" }, - { name = "pyarrow" }, { name = "pyarrow", marker = "extra == 'notebooks'" }, ] @@ -1650,20 +1646,6 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/88/5f/e351af9a41f866ac3f1fac4ca0613908d9a41741cfcf2228f4ad853b697d/pluggy-1.5.0-py3-none-any.whl", hash = "sha256:44e1ad92c8ca002de6377e165f3e0f1be63266ab4d554740532335b9d75ea669", size = 20556 }, ] -[[package]] -name = "pooch" -version = "1.8.2" -source = { registry = "https://pypi.org/simple" } -dependencies = [ - { name = "packaging" }, - { name = "platformdirs" }, - { name = "requests" }, -] -sdist = { url = "https://files.pythonhosted.org/packages/c6/77/b3d3e00c696c16cf99af81ef7b1f5fe73bd2a307abca41bd7605429fe6e5/pooch-1.8.2.tar.gz", hash = "sha256:76561f0de68a01da4df6af38e9955c4c9d1a5c90da73f7e40276a5728ec83d10", size = 59353 } -wheels = [ - { url = "https://files.pythonhosted.org/packages/a8/87/77cc11c7a9ea9fd05503def69e3d18605852cd0d4b0d3b8f15bbeb3ef1d1/pooch-1.8.2-py3-none-any.whl", hash = "sha256:3529a57096f7198778a5ceefd5ac3ef0e4d06a6ddaf9fc2d609b806f25302c47", size = 64574 }, -] - [[package]] name = "prometheus-client" version = "0.21.0" From 7936780a33ee7250f933ad26f34a61f59cd10ddf Mon Sep 17 00:00:00 2001 From: Trevor Manz Date: Thu, 7 Nov 2024 20:50:11 -0500 Subject: [PATCH 2/3] fix notebook --- .gitignore | 1 + notebooks/getting-started.ipynb | 99 +++++++++++---------------------- 2 files changed, 32 insertions(+), 68 deletions(-) diff --git a/.gitignore b/.gitignore index 2cb6534..8bddda4 100644 --- a/.gitignore +++ b/.gitignore @@ -8,3 +8,4 @@ data/ dist/ mair/ .DS_Store +data.zip diff --git a/notebooks/getting-started.ipynb b/notebooks/getting-started.ipynb index 62b1c8c..b89fdf5 100644 --- a/notebooks/getting-started.ipynb +++ b/notebooks/getting-started.ipynb @@ -1,31 +1,11 @@ { - "metadata": { - "kernelspec": { - "name": "python3", - "language": "python" - }, - "language_info": { - "name": "python", - "version": "3.11.10", - "codemirror_mode": { - "name": "ipython", - "version": 3 - } - }, - "widgets": {} - }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "cell_type": "code", "id": "b640237c", - "metadata": { - "jupyter": { - "source_hidden": true, - "outputs_hidden": null - } - }, + "metadata": {}, "execution_count": null, "source": [ "# /// script\n", @@ -52,15 +32,12 @@ "In this notebook we're going to demonstrate how to use `cev` to compare (a) two _different_ embeddings of the same data and (b) two aligned embeddings of _different_ data.\n", "\n", "The embeddings we're exploring in this notebook represent single-cell surface proteomic data. In other words, each data point represents a individual cell whose surface protein expression was measured. The cells were then clustered into cellular phenotypes based on their protein expression." - ], - "attachments": null + ] }, { "cell_type": "code", "id": "47c31bea-24b3-4d16-a69a-a3ad3a746234", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "import zipfile\n", @@ -82,15 +59,12 @@ "- Tumor sample 6 (82 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", "\n", "All three embeddings are annotated with [Ozette's FAUST method](https://doi.org/10.1016/j.patter.2021.100372)." - ], - "attachments": null + ] }, { "cell_type": "code", "id": "dbf802bc-f709-4163-9b49-8fa5f6ce59ab", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "archive = pooch.retrieve(\n", @@ -105,9 +79,7 @@ { "cell_type": "markdown", "id": "e62390d2-1242-49a8-9780-be976d39fa42", - "metadata": { - "tags": [] - }, + "metadata": {}, "source": [ "## Comparing Two Embeddings of the same Data\n", "\n", @@ -119,15 +91,12 @@ "2. Neighborhood: the degree to which the local neighborhood of a label has changed between the two embeddings\n", "\n", "Visualized as a heatmap, these two property can quickly guide us to point clusters that are better or less resolved in either one of the two embeddings. It can also help us find compositional changes between the two embeddings." - ], - "attachments": null + ] }, { "cell_type": "code", "id": "7874813c-810f-40e5-92ab-91f228046a5e", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "with zipfile.ZipFile(archive, \"r\") as z:\n", @@ -143,9 +112,7 @@ { "cell_type": "code", "id": "c3d7e114-9fd3-4785-bdca-e3f4bbf37df8", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "umap_vs_ozette = EmbeddingComparisonWidget(\n", @@ -171,15 +138,12 @@ "When selecting _Neighborhood_ from the _Metric_ drop down menu, we switch to the neighborhood composition difference quantification. When only a few markers (e.g., `CD4` and `CD8`) are active, we can see that most of the neighborhood remain unchanged. When we gradually add more markers, we can see how the the local neighborhood composition difference slowly increases, which is due to the fact that the annotation transformation spaces out all point label clusters.\n", "\n", "To study certain clusters or labels in detail, you can either interactively select points in the embedding via [jupyter-scatter](https://github.com/flekschas/jupyter-scatter)'s lasso selection or you can programmatically select points by their label via the `select()`. For instance, the next call will select all CD4+ T cells." - ], - "attachments": null + ] }, { "cell_type": "code", "id": "ba7a378f-4212-4953-be5b-7a273f8bc75e", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "umap_vs_ozette.select([\"CD3+\", \"CD4+\", \"CD8-\"])" @@ -194,15 +158,12 @@ "## Size Differences Between _Non-Responder_ and _Responder_\n", "\n", "Instead of comparing identical data, let's take a look at two transformed and aligned embeddings: tissue vs tumor. The embeddings are both annotation-transformed and aligned, ensuring low confusion and high neighborhood similarity (check to confirm!). The abundance metric aids in identifying potential shifts in phenotype abundance, providing a comprehensive and visually intuitive method for analyzing complex cytometry data. Remember, our metric should be used as a exploratory tool guide exploration and quickly surface potentially interesting phenotypes, but robust statical methods must be applied to confirm whether any abundance differences exist." - ], - "attachments": null + ] }, { "cell_type": "code", "id": "0f99361b-6e96-4a6d-ad65-0533c23bece7", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "tissue_vs_tumor = EmbeddingComparisonWidget(\n", @@ -222,20 +183,15 @@ { "cell_type": "markdown", "id": "6d632c95-dff8-4b90-b763-f3055c4e8047", - "metadata": { - "tags": [] - }, + "metadata": {}, "source": [ "The following **CD8+ T cells** are more abundant in `tissue` (i.e., the relative abundance is higher on the left) compared to `tumor` (i.e., the relative abundance is lower on the right)" - ], - "attachments": null + ] }, { "cell_type": "code", "id": "2f7ebd73-32e7-48ed-8575-8d14d2edc73f", - "metadata": { - "tags": [] - }, + "metadata": {}, "execution_count": null, "source": [ "tissue_vs_tumor.select(\n", @@ -243,14 +199,21 @@ ")" ], "outputs": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "name": "python3", + "language": "python" }, - { - "cell_type": "code", - "id": "eefac753-7920-4c87-99ef-d155f1ec5114", - "metadata": {}, - "execution_count": null, - "source": [], - "outputs": [] + "language_info": { + "name": "python", + "version": "3.11.10", + "codemirror_mode": { + "name": "ipython", + "version": 3 + } } - ] + } } From 647e376a0021d56d45ec07bfbadef38971afd1d4 Mon Sep 17 00:00:00 2001 From: Trevor Manz Date: Thu, 7 Nov 2024 20:50:31 -0500 Subject: [PATCH 3/3] uv fixes --- notebooks/getting-started.ipynb | 433 ++++++++++++++++---------------- 1 file changed, 217 insertions(+), 216 deletions(-) diff --git a/notebooks/getting-started.ipynb b/notebooks/getting-started.ipynb index b89fdf5..5aa7bd6 100644 --- a/notebooks/getting-started.ipynb +++ b/notebooks/getting-started.ipynb @@ -1,219 +1,220 @@ { - "nbformat": 4, - "nbformat_minor": 5, - "cells": [ - { - "cell_type": "code", - "id": "b640237c", - "metadata": {}, - "execution_count": null, - "source": [ - "# /// script\n", - "# requires-python = \"==3.12\"\n", - "# dependencies = [\n", - "# \"cev\",\n", - "# \"pooch==1.8.2\",\n", - "# \"pyarrow\",\n", - "# ]\n", - "#\n", - "# [tool.uv.sources]\n", - "# cev = { path = \"../\" }\n", - "# ///" - ], - "outputs": [] - }, - { - "cell_type": "markdown", - "id": "8efc6d60-f207-4e54-92b0-a6070b0158b4", - "metadata": {}, - "source": [ - "# Getting Started\n", - "\n", - "In this notebook we're going to demonstrate how to use `cev` to compare (a) two _different_ embeddings of the same data and (b) two aligned embeddings of _different_ data.\n", - "\n", - "The embeddings we're exploring in this notebook represent single-cell surface proteomic data. In other words, each data point represents a individual cell whose surface protein expression was measured. The cells were then clustered into cellular phenotypes based on their protein expression." - ] - }, - { - "cell_type": "code", - "id": "47c31bea-24b3-4d16-a69a-a3ad3a746234", - "metadata": {}, - "execution_count": null, - "source": [ - "import zipfile\n", - "import pandas as pd\n", - "import pooch\n", - "\n", - "from cev.widgets import Embedding, EmbeddingComparisonWidget" - ], - "outputs": [] - }, - { - "cell_type": "markdown", - "id": "dea71d70-e467-49af-9165-6e278f953977", - "metadata": {}, - "source": [ - "The notebook requires downloading the three embeddings from data of from [Mair et al., 2022](https://www.nature.com/articles/s41586-022-04718-w):\n", - "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/)\n", - "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", - "- Tumor sample 6 (82 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", - "\n", - "All three embeddings are annotated with [Ozette's FAUST method](https://doi.org/10.1016/j.patter.2021.100372)." - ] - }, - { - "cell_type": "code", - "id": "dbf802bc-f709-4163-9b49-8fa5f6ce59ab", - "metadata": {}, - "execution_count": null, - "source": [ - "archive = pooch.retrieve(\n", - " url=\"https://figshare.com/ndownloader/articles/23063615/versions/1\",\n", - " path=pooch.os_cache(\"cev\"),\n", - " fname=\"data.zip\",\n", - " known_hash=None,\n", - ")" - ], - "outputs": [] - }, - { - "cell_type": "markdown", - "id": "e62390d2-1242-49a8-9780-be976d39fa42", - "metadata": {}, - "source": [ - "## Comparing Two Embeddings of the same Data\n", - "\n", - "In the first example, we are going to use `cev` to compare two different embeddings methods that were run on the very same data (the tissue sample): standard UMAP and annotation transformation UMAP.\n", - "\n", - "Different embedding methods can produce very different embedding spaces and it's often hard to assess the difference wholelistically. `cev` enables us to quantify two properties based on shared point labels:\n", - "\n", - "1. Confusion: the degree to which two or more labels are visually intermixed\n", - "2. Neighborhood: the degree to which the local neighborhood of a label has changed between the two embeddings\n", - "\n", - "Visualized as a heatmap, these two property can quickly guide us to point clusters that are better or less resolved in either one of the two embeddings. It can also help us find compositional changes between the two embeddings." - ] - }, - { - "cell_type": "code", - "id": "7874813c-810f-40e5-92ab-91f228046a5e", - "metadata": {}, - "execution_count": null, - "source": [ - "with zipfile.ZipFile(archive, \"r\") as z:\n", - " with z.open(\"mair-2022-tissue-138-umap.pq\") as f:\n", - " tissue_umap_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)\n", - " with z.open(\"mair-2022-tissue-138-ozette.pq\") as f:\n", - " tissue_ozette_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)\n", - " with z.open(\"mair-2022-tumor-006-ozette.pq\") as f:\n", - " tumor_ozette_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)" - ], - "outputs": [] - }, - { - "cell_type": "code", - "id": "c3d7e114-9fd3-4785-bdca-e3f4bbf37df8", - "metadata": {}, - "execution_count": null, - "source": [ - "umap_vs_ozette = EmbeddingComparisonWidget(\n", - " tissue_umap_embedding,\n", - " tissue_ozette_embedding,\n", - " titles=[\"Standard UMAP (Tissue)\", \"Annotation-Transformed UMAP (Tissue)\"],\n", - " metric=\"confusion\",\n", - " selection=\"synced\",\n", - " auto_zoom=True,\n", - " row_height=320,\n", - ")\n", - "umap_vs_ozette" - ], - "outputs": [] - }, - { - "cell_type": "markdown", - "id": "a516d65a-351b-4365-a267-704cd93a9c0e", - "metadata": {}, - "source": [ - "In this example, we can see that the point labels are much more intermixed in the standard UMAP embedding compared to the annotation transformation UMAP. This not surprising as the standard UMAP embedding is not optimized for Flow cytometry data in any way and is thus only resolving broad cell phenotypes based on a few markers. You can see this by holding down `SHIFT` and clicking on `CD8` under _Markers_, which reduces the label resolution and shows that under a reduced label resolution, the confusion is much lower in the standard UMAP embedding.\n", - "\n", - "When selecting _Neighborhood_ from the _Metric_ drop down menu, we switch to the neighborhood composition difference quantification. When only a few markers (e.g., `CD4` and `CD8`) are active, we can see that most of the neighborhood remain unchanged. When we gradually add more markers, we can see how the the local neighborhood composition difference slowly increases, which is due to the fact that the annotation transformation spaces out all point label clusters.\n", - "\n", - "To study certain clusters or labels in detail, you can either interactively select points in the embedding via [jupyter-scatter](https://github.com/flekschas/jupyter-scatter)'s lasso selection or you can programmatically select points by their label via the `select()`. For instance, the next call will select all CD4+ T cells." - ] - }, - { - "cell_type": "code", - "id": "ba7a378f-4212-4953-be5b-7a273f8bc75e", - "metadata": {}, - "execution_count": null, - "source": [ - "umap_vs_ozette.select([\"CD3+\", \"CD4+\", \"CD8-\"])" - ], - "outputs": [] - }, - { - "cell_type": "markdown", - "id": "3c439e4d-0679-4e64-a1c7-4be93cbbe039", - "metadata": {}, - "source": [ - "## Size Differences Between _Non-Responder_ and _Responder_\n", - "\n", - "Instead of comparing identical data, let's take a look at two transformed and aligned embeddings: tissue vs tumor. The embeddings are both annotation-transformed and aligned, ensuring low confusion and high neighborhood similarity (check to confirm!). The abundance metric aids in identifying potential shifts in phenotype abundance, providing a comprehensive and visually intuitive method for analyzing complex cytometry data. Remember, our metric should be used as a exploratory tool guide exploration and quickly surface potentially interesting phenotypes, but robust statical methods must be applied to confirm whether any abundance differences exist." - ] - }, - { - "cell_type": "code", - "id": "0f99361b-6e96-4a6d-ad65-0533c23bece7", - "metadata": {}, - "execution_count": null, - "source": [ - "tissue_vs_tumor = EmbeddingComparisonWidget(\n", - " tissue_ozette_embedding,\n", - " tumor_ozette_embedding,\n", - " titles=[\"Tissue\", \"Tumor\"],\n", - " metric=\"abundance\",\n", - " selection=\"phenotype\",\n", - " auto_zoom=True,\n", - " row_height=320,\n", - ")\n", - "\n", - "tissue_vs_tumor" - ], - "outputs": [] - }, - { - "cell_type": "markdown", - "id": "6d632c95-dff8-4b90-b763-f3055c4e8047", - "metadata": {}, - "source": [ - "The following **CD8+ T cells** are more abundant in `tissue` (i.e., the relative abundance is higher on the left) compared to `tumor` (i.e., the relative abundance is lower on the right)" - ] - }, - { - "cell_type": "code", - "id": "2f7ebd73-32e7-48ed-8575-8d14d2edc73f", - "metadata": {}, - "execution_count": null, - "source": [ - "tissue_vs_tumor.select(\n", - " \"CD4-CD8+CD3+CD45RA+CD27+CD19-CD103-CD28-CD69+PD1+HLADR-GranzymeB-CD25-ICOS-TCRgd-CD38-CD127-Tim3-\"\n", - ")" - ], - "outputs": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "name": "python3", - "language": "python" - }, - "language_info": { - "name": "python", - "version": "3.11.10", - "codemirror_mode": { - "name": "ipython", - "version": 3 - } - } + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "b640237c", + "metadata": {}, + "outputs": [], + "source": [ + "# /// script\n", + "# requires-python = \"==3.12\"\n", + "# dependencies = [\n", + "# \"cev\",\n", + "# \"pooch==1.8.2\",\n", + "# \"pyarrow\",\n", + "# ]\n", + "#\n", + "# [tool.uv.sources]\n", + "# cev = { path = \"../\" }\n", + "# ///" + ] + }, + { + "cell_type": "markdown", + "id": "8efc6d60-f207-4e54-92b0-a6070b0158b4", + "metadata": {}, + "source": [ + "# Getting Started\n", + "\n", + "In this notebook we're going to demonstrate how to use `cev` to compare (a) two _different_ embeddings of the same data and (b) two aligned embeddings of _different_ data.\n", + "\n", + "The embeddings we're exploring in this notebook represent single-cell surface proteomic data. In other words, each data point represents a individual cell whose surface protein expression was measured. The cells were then clustered into cellular phenotypes based on their protein expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47c31bea-24b3-4d16-a69a-a3ad3a746234", + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "\n", + "import pandas as pd\n", + "import pooch\n", + "\n", + "from cev.widgets import Embedding, EmbeddingComparisonWidget" + ] + }, + { + "cell_type": "markdown", + "id": "dea71d70-e467-49af-9165-6e278f953977", + "metadata": {}, + "source": [ + "The notebook requires downloading the three embeddings from data of from [Mair et al., 2022](https://www.nature.com/articles/s41586-022-04718-w):\n", + "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/)\n", + "- Tissue sample 138 (32 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", + "- Tumor sample 6 (82 MB) embedded with [UMAP](https://umap-learn.readthedocs.io/en/latest/) after being transformd with [Ozette's Annotation Transformation](https://github.com/flekschas-ozette/ismb-biovis-2022)\n", + "\n", + "All three embeddings are annotated with [Ozette's FAUST method](https://doi.org/10.1016/j.patter.2021.100372)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbf802bc-f709-4163-9b49-8fa5f6ce59ab", + "metadata": {}, + "outputs": [], + "source": [ + "archive = pooch.retrieve(\n", + " url=\"https://figshare.com/ndownloader/articles/23063615/versions/1\",\n", + " path=pooch.os_cache(\"cev\"),\n", + " fname=\"data.zip\",\n", + " known_hash=None,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e62390d2-1242-49a8-9780-be976d39fa42", + "metadata": {}, + "source": [ + "## Comparing Two Embeddings of the same Data\n", + "\n", + "In the first example, we are going to use `cev` to compare two different embeddings methods that were run on the very same data (the tissue sample): standard UMAP and annotation transformation UMAP.\n", + "\n", + "Different embedding methods can produce very different embedding spaces and it's often hard to assess the difference wholelistically. `cev` enables us to quantify two properties based on shared point labels:\n", + "\n", + "1. Confusion: the degree to which two or more labels are visually intermixed\n", + "2. Neighborhood: the degree to which the local neighborhood of a label has changed between the two embeddings\n", + "\n", + "Visualized as a heatmap, these two property can quickly guide us to point clusters that are better or less resolved in either one of the two embeddings. It can also help us find compositional changes between the two embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7874813c-810f-40e5-92ab-91f228046a5e", + "metadata": {}, + "outputs": [], + "source": [ + "with zipfile.ZipFile(archive, \"r\") as z:\n", + " with z.open(\"mair-2022-tissue-138-umap.pq\") as f:\n", + " tissue_umap_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)\n", + " with z.open(\"mair-2022-tissue-138-ozette.pq\") as f:\n", + " tissue_ozette_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)\n", + " with z.open(\"mair-2022-tumor-006-ozette.pq\") as f:\n", + " tumor_ozette_embedding = pd.read_parquet(f).pipe(Embedding.from_ozette)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3d7e114-9fd3-4785-bdca-e3f4bbf37df8", + "metadata": {}, + "outputs": [], + "source": [ + "umap_vs_ozette = EmbeddingComparisonWidget(\n", + " tissue_umap_embedding,\n", + " tissue_ozette_embedding,\n", + " titles=[\"Standard UMAP (Tissue)\", \"Annotation-Transformed UMAP (Tissue)\"],\n", + " metric=\"confusion\",\n", + " selection=\"synced\",\n", + " auto_zoom=True,\n", + " row_height=320,\n", + ")\n", + "umap_vs_ozette" + ] + }, + { + "cell_type": "markdown", + "id": "a516d65a-351b-4365-a267-704cd93a9c0e", + "metadata": {}, + "source": [ + "In this example, we can see that the point labels are much more intermixed in the standard UMAP embedding compared to the annotation transformation UMAP. This not surprising as the standard UMAP embedding is not optimized for Flow cytometry data in any way and is thus only resolving broad cell phenotypes based on a few markers. You can see this by holding down `SHIFT` and clicking on `CD8` under _Markers_, which reduces the label resolution and shows that under a reduced label resolution, the confusion is much lower in the standard UMAP embedding.\n", + "\n", + "When selecting _Neighborhood_ from the _Metric_ drop down menu, we switch to the neighborhood composition difference quantification. When only a few markers (e.g., `CD4` and `CD8`) are active, we can see that most of the neighborhood remain unchanged. When we gradually add more markers, we can see how the the local neighborhood composition difference slowly increases, which is due to the fact that the annotation transformation spaces out all point label clusters.\n", + "\n", + "To study certain clusters or labels in detail, you can either interactively select points in the embedding via [jupyter-scatter](https://github.com/flekschas/jupyter-scatter)'s lasso selection or you can programmatically select points by their label via the `select()`. For instance, the next call will select all CD4+ T cells." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba7a378f-4212-4953-be5b-7a273f8bc75e", + "metadata": {}, + "outputs": [], + "source": [ + "umap_vs_ozette.select([\"CD3+\", \"CD4+\", \"CD8-\"])" + ] + }, + { + "cell_type": "markdown", + "id": "3c439e4d-0679-4e64-a1c7-4be93cbbe039", + "metadata": {}, + "source": [ + "## Size Differences Between _Non-Responder_ and _Responder_\n", + "\n", + "Instead of comparing identical data, let's take a look at two transformed and aligned embeddings: tissue vs tumor. The embeddings are both annotation-transformed and aligned, ensuring low confusion and high neighborhood similarity (check to confirm!). The abundance metric aids in identifying potential shifts in phenotype abundance, providing a comprehensive and visually intuitive method for analyzing complex cytometry data. Remember, our metric should be used as a exploratory tool guide exploration and quickly surface potentially interesting phenotypes, but robust statical methods must be applied to confirm whether any abundance differences exist." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f99361b-6e96-4a6d-ad65-0533c23bece7", + "metadata": {}, + "outputs": [], + "source": [ + "tissue_vs_tumor = EmbeddingComparisonWidget(\n", + " tissue_ozette_embedding,\n", + " tumor_ozette_embedding,\n", + " titles=[\"Tissue\", \"Tumor\"],\n", + " metric=\"abundance\",\n", + " selection=\"phenotype\",\n", + " auto_zoom=True,\n", + " row_height=320,\n", + ")\n", + "\n", + "tissue_vs_tumor" + ] + }, + { + "cell_type": "markdown", + "id": "6d632c95-dff8-4b90-b763-f3055c4e8047", + "metadata": {}, + "source": [ + "The following **CD8+ T cells** are more abundant in `tissue` (i.e., the relative abundance is higher on the left) compared to `tumor` (i.e., the relative abundance is lower on the right)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f7ebd73-32e7-48ed-8575-8d14d2edc73f", + "metadata": {}, + "outputs": [], + "source": [ + "tissue_vs_tumor.select(\n", + " \"CD4-CD8+CD3+CD45RA+CD27+CD19-CD103-CD28-CD69+PD1+HLADR-GranzymeB-CD25-ICOS-TCRgd-CD38-CD127-Tim3-\"\n", + ")" + ] } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "name": "python", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 }