-
Notifications
You must be signed in to change notification settings - Fork 556
Add image-text-to-image and image-text-to-video tasks
#1866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
47d5221
add image-text-to-image and image-text-to-video tasks
apolinario 3aec3d2
add snippetGenerator
apolinario d8b1a33
Update packages/tasks/src/tasks/image-text-to-video/about.md
apolinario 4262e28
Update packages/tasks/src/tasks/image-text-to-video/about.md
apolinario 6293843
Update packages/tasks/src/tasks/image-text-to-video/about.md
apolinario 0a2d889
Update packages/tasks/src/tasks/image-text-to-video/data.ts
apolinario 9913ed9
change examples and data
apolinario 7dbfd4b
modify the filenames
apolinario cd0934d
Merge branch 'new-image-text-tasks' of https://github.com/huggingface…
apolinario File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| import type { ImageTextToImageInput } from "@huggingface/tasks"; | ||
| import { resolveProvider } from "../../lib/getInferenceProviderMapping.js"; | ||
| import { getProviderHelper } from "../../lib/getProviderHelper.js"; | ||
| import type { BaseArgs, Options } from "../../types.js"; | ||
| import { innerRequest } from "../../utils/request.js"; | ||
|
|
||
| export type ImageTextToImageArgs = BaseArgs & ImageTextToImageInput; | ||
|
|
||
| /** | ||
| * This task takes an image and text input and outputs a new generated image. | ||
| * Recommended model: black-forest-labs/FLUX.2-dev | ||
| */ | ||
| export async function imageTextToImage(args: ImageTextToImageArgs, options?: Options): Promise<Blob> { | ||
| const provider = await resolveProvider(args.provider, args.model, args.endpointUrl); | ||
| const providerHelper = getProviderHelper(provider, "image-text-to-image"); | ||
| const payload = await providerHelper.preparePayloadAsync(args); | ||
| const { data: res, requestContext } = await innerRequest<Blob>(payload, providerHelper, { | ||
| ...options, | ||
| task: "image-text-to-image", | ||
| }); | ||
| return providerHelper.getResponse(res, requestContext.url, requestContext.info.headers as Record<string, string>); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| import type { ImageTextToVideoInput } from "@huggingface/tasks"; | ||
| import { resolveProvider } from "../../lib/getInferenceProviderMapping.js"; | ||
| import { getProviderHelper } from "../../lib/getProviderHelper.js"; | ||
| import type { BaseArgs, Options } from "../../types.js"; | ||
| import { innerRequest } from "../../utils/request.js"; | ||
|
|
||
| export type ImageTextToVideoArgs = BaseArgs & ImageTextToVideoInput; | ||
|
|
||
| /** | ||
| * This task takes an image and text input and outputs a generated video. | ||
| * Recommended model: Lightricks/LTX-Video | ||
| */ | ||
| export async function imageTextToVideo(args: ImageTextToVideoArgs, options?: Options): Promise<Blob> { | ||
| const provider = await resolveProvider(args.provider, args.model, args.endpointUrl); | ||
| const providerHelper = getProviderHelper(provider, "image-text-to-video"); | ||
| const payload = await providerHelper.preparePayloadAsync(args); | ||
| const { data: res, requestContext } = await innerRequest<Blob>(payload, providerHelper, { | ||
| ...options, | ||
| task: "image-text-to-video", | ||
| }); | ||
| return providerHelper.getResponse(res, requestContext.url, requestContext.info.headers as Record<string, string>); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| ## Use Cases | ||
|
|
||
| ### Instruction-based Image Editing | ||
|
|
||
| Image-text-to-image models can be used to edit images based on natural language instructions. For example, you can provide an image of a summer landscape and the instruction "Make it winter, add snow" to generate a winter version of the same scene. | ||
|
|
||
| ### Style Transfer | ||
|
|
||
| These models can apply artistic styles or transformations to images based on text descriptions. For instance, you can transform a photo into a painting style by providing prompts like "Make it look like a Van Gogh painting" or "Convert to watercolor style." | ||
|
|
||
| ### Image Variations | ||
|
|
||
| Generate variations of an existing image by providing different text prompts. This is useful for creative workflows where you want to explore different versions of the same image with specific modifications. | ||
|
|
||
| ### Guided Image Generation | ||
|
|
||
| Use a reference image along with text prompts to guide the generation process. This allows for more controlled image generation compared to text-to-image models alone, as the reference image provides structural guidance. | ||
|
|
||
| ### Image Inpainting and Outpainting | ||
|
|
||
| Fill in missing or masked parts of an image based on text descriptions, or extend an image beyond its original boundaries with text-guided generation. | ||
|
|
||
| ## Task Variants | ||
|
|
||
| ### Instruction-based Editing | ||
|
|
||
| Models that follow natural language instructions to edit images, which can perform complex edits like object removal, color changes, and compositional modifications. | ||
|
|
||
| ### Reference-guided Generation | ||
|
|
||
| Models that use a reference image to guide the generation process while incorporating text prompts to control specific attributes or modifications. | ||
|
|
||
| ### Conditional Image-to-Image | ||
|
|
||
| Models that perform specific transformations based on text conditions, such as changing weather conditions, time of day, or seasonal variations. | ||
|
|
||
| ## Inference | ||
|
|
||
| You can use the Diffusers library to interact with image-text-to-image models. | ||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import Flux2Pipeline | ||
| from diffusers.utils import load_image | ||
|
|
||
| repo_id = "black-forest-labs/FLUX.2-dev" | ||
| device = "cuda:0" | ||
| torch_dtype = torch.bfloat16 | ||
|
|
||
| pipe = Flux2Pipeline.from_pretrained( | ||
| repo_id, torch_dtype=torch_dtype | ||
| ) | ||
| pipe.enable_model_cpu_offload() #no need to do cpu offload for >80G VRAM carts like H200, B200, etc. and do a `pipe.to(device)` instead | ||
|
|
||
| prompt = "Realistic macro photograph of a hermit crab using a soda can as its shell, partially emerging from the can, captured with sharp detail and natural colors, on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean waves in the background. The can has the text `BFL Diffusers` on it and it has a color gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom." | ||
|
|
||
| #cat_image = load_image("https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev-fp8-dynamic/resolve/main/cat.png") | ||
| image = pipe( | ||
| prompt=prompt, | ||
| #image=[cat_image] #multi-image input | ||
| generator=torch.Generator(device=device).manual_seed(42), | ||
| num_inference_steps=50, | ||
| guidance_scale=4, | ||
| ).images[0] | ||
|
|
||
| image.save("flux2_output.png") | ||
| ``` | ||
|
|
||
| ## Useful Resources | ||
|
|
||
| - [FLUX.2 Model Card](https://huggingface.co/black-forest-labs/FLUX.2-dev) | ||
| - [Diffusers documentation on Image-to-Image](https://huggingface.co/docs/diffusers/using-diffusers/img2img) | ||
| - [ControlNet for Conditional Image Generation](https://huggingface.co/docs/diffusers/using-diffusers/controlnet) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| import type { TaskDataCustom } from "../index.js"; | ||
|
|
||
| const taskData: TaskDataCustom = { | ||
| datasets: [], | ||
| demo: { | ||
| inputs: [ | ||
| { | ||
| filename: "image-text-to-image-input.jpeg", | ||
| type: "img", | ||
| }, | ||
| { | ||
| label: "Input", | ||
| content: "A city above clouds, pastel colors, Victorian style", | ||
| type: "text", | ||
| }, | ||
| ], | ||
| outputs: [ | ||
| { | ||
| filename: "image-text-to-image-output.png", | ||
| type: "img", | ||
| }, | ||
| ], | ||
| }, | ||
| metrics: [ | ||
| { | ||
| description: | ||
| "The Fréchet Inception Distance (FID) calculates the distance between distributions between synthetic and real samples. A lower FID score indicates better similarity between the distributions of real and generated images.", | ||
| id: "FID", | ||
| }, | ||
| { | ||
| description: | ||
| "CLIP Score measures the similarity between the generated image and the text prompt using CLIP embeddings. A higher score indicates better alignment with the text prompt.", | ||
| id: "CLIP", | ||
| }, | ||
| ], | ||
| models: [ | ||
| { | ||
| description: "A powerful model for image-text-to-image generation.", | ||
| id: "black-forest-labs/FLUX.2-dev", | ||
apolinario marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| }, | ||
| ], | ||
| spaces: [ | ||
| { | ||
| description: "An application for image-text-to-image generation.", | ||
| id: "black-forest-labs/FLUX.2-dev", | ||
apolinario marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| }, | ||
| ], | ||
| summary: | ||
| "Image-text-to-image models take an image and a text prompt as input and generate a new image based on the reference image and text instructions. These models are useful for image editing, style transfer, image variations, and guided image generation tasks.", | ||
| widgetModels: ["black-forest-labs/FLUX.2-dev"], | ||
| youtubeId: undefined, | ||
| }; | ||
|
|
||
| export default taskData; | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| /** | ||
| * Inference code generated from the JSON schema spec in ./spec | ||
| * | ||
| * Using src/scripts/inference-codegen | ||
| */ | ||
| /** | ||
| * Inputs for Image Text To Image inference. Either inputs (image) or prompt (in parameters) | ||
| * must be provided, or both. | ||
| */ | ||
| export interface ImageTextToImageInput { | ||
| /** | ||
| * The input image data as a base64-encoded string. If no `parameters` are provided, you can | ||
| * also provide the image data as a raw bytes payload. Either this or prompt must be | ||
| * provided. | ||
| */ | ||
| inputs?: Blob; | ||
| /** | ||
| * Additional inference parameters for Image Text To Image | ||
| */ | ||
| parameters?: ImageTextToImageParameters; | ||
| [property: string]: unknown; | ||
| } | ||
| /** | ||
| * Additional inference parameters for Image Text To Image | ||
| */ | ||
| export interface ImageTextToImageParameters { | ||
| /** | ||
| * For diffusion models. A higher guidance scale value encourages the model to generate | ||
| * images closely linked to the text prompt at the expense of lower image quality. | ||
| */ | ||
| guidance_scale?: number; | ||
| /** | ||
| * One prompt to guide what NOT to include in image generation. | ||
| */ | ||
| negative_prompt?: string; | ||
| /** | ||
| * For diffusion models. The number of denoising steps. More denoising steps usually lead to | ||
| * a higher quality image at the expense of slower inference. | ||
| */ | ||
| num_inference_steps?: number; | ||
| /** | ||
| * The text prompt to guide the image generation. Either this or inputs (image) must be | ||
| * provided. | ||
| */ | ||
| prompt?: string; | ||
| /** | ||
| * Seed for the random number generator. | ||
| */ | ||
| seed?: number; | ||
| /** | ||
| * The size in pixels of the output image. This parameter is only supported by some | ||
| * providers and for specific models. It will be ignored when unsupported. | ||
| */ | ||
| target_size?: TargetSize; | ||
| [property: string]: unknown; | ||
| } | ||
| /** | ||
| * The size in pixels of the output image. This parameter is only supported by some | ||
| * providers and for specific models. It will be ignored when unsupported. | ||
| */ | ||
| export interface TargetSize { | ||
| height: number; | ||
| width: number; | ||
| [property: string]: unknown; | ||
| } | ||
| /** | ||
| * Outputs of inference for the Image Text To Image task | ||
| */ | ||
| export interface ImageTextToImageOutput { | ||
| /** | ||
| * The generated image returned as raw bytes in the payload. | ||
| */ | ||
| image: unknown; | ||
| [property: string]: unknown; | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.