Skip to content

Commit d986614

Browse files
authored
feat(component,ai,gemini): add image generation support (#1122)
Because - The Gemini component lacked support for image generation using the `gemini-2.5-flash-image-preview` model - Generated images were being exposed as raw binary data arrays in the `candidates` field instead of being properly extracted to the `images` field - Image processing was duplicated and inefficient in streaming responses, processing the same images multiple times This commit - Adds image generation capability to the Gemini component by extracting images from `genai.GenerateContentResponse` - Extracts generated images from API responses and converts them to `format.Image` objects in the `Images` output field - Cleans up raw binary `InlineData` from candidates after image extraction to prevent JSON exposure of binary data - Optimizes streaming responses by deferring image processing to final response only, eliminating duplicate processing - Updates component configuration documentation to clarify the relationship between `images` field and `candidates` field
1 parent 1b4cd1f commit d986614

File tree

4 files changed

+214
-13
lines changed

4 files changed

+214
-13
lines changed

pkg/component/ai/gemini/v0/config/tasks.yaml

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1556,19 +1556,22 @@ TASK_CHAT:
15561556
ID of the model to use.
15571557
The value is one of the following:
15581558
`gemini-2.5-pro`: Optimized for enhanced thinking and reasoning, multimodal understanding, advanced coding, and more.
1559-
`gemini-2.5-flash`: Optimized for Adaptive thinking, cost efficiency.
1560-
`gemini-2.0-flash-lite`: Optimized for Most cost-efficient model supporting high throughput.
1559+
`gemini-2.5-flash`: Optimized for adaptive thinking, cost efficiency.
1560+
`gemini-2.5-flash-lite`: Optimized for most cost-efficient model supporting high throughput.
1561+
`gemini-2.5-flash-image-preview`: Optimized for precise, conversational image generation and editing.
15611562
type: string
15621563
enum:
15631564
- gemini-2.5-pro
15641565
- gemini-2.5-flash
1565-
- gemini-2.0-flash-lite
1566+
- gemini-2.5-flash-lite
1567+
- gemini-2.5-flash-image-preview
15661568
default: gemini-2.5-flash
15671569
instillCredentialMap:
15681570
values:
15691571
- gemini-2.5-pro
15701572
- gemini-2.5-flash
1571-
- gemini-2.0-flash-lite
1573+
- gemini-2.5-flash-lite
1574+
- gemini-2.5-flash-image-preview
15721575
targets:
15731576
- setup.api-key
15741577
stream:
@@ -1709,20 +1712,31 @@ TASK_CHAT:
17091712
title: Texts
17101713
description: >-
17111714
Simplified text output extracted from candidates. Each string represents the concatenated text content from the corresponding candidate's parts,
1712-
including thought processes when `include-thoughts` is enabled. This field provides easy access to the generated text without needing to traverse
1715+
including thought processes when `include-thoughts` is enabled. This field provides easy access to the generated text without needing to traverse
17131716
the candidate structure. Updated in real-time during streaming.
17141717
type: array
17151718
items:
17161719
type: string
1717-
usage:
1720+
images:
17181721
uiOrder: 1
1722+
title: Images
1723+
description: >-
1724+
Images output extracted and converted from candidates. This field provides easy access to the generated images as base64-encoded strings.
1725+
The original binary data is removed from the candidates field to prevent raw binary exposure in JSON output. This field is only available when
1726+
the model supports image generation.
1727+
type: array
1728+
items:
1729+
title: Image
1730+
type: image/webp
1731+
usage:
1732+
uiOrder: 2
17191733
title: Usage
17201734
description: >-
17211735
Token usage statistics: prompt tokens, completion tokens, total tokens, etc.
17221736
type: object
17231737
additionalProperties: true
17241738
candidates:
1725-
uiOrder: 2
1739+
uiOrder: 3
17261740
title: Candidates
17271741
description: >-
17281742
Complete candidate objects from the model containing rich metadata and structured content. Each candidate includes safety ratings, finish reason,
@@ -1732,18 +1746,18 @@ TASK_CHAT:
17321746
items:
17331747
$ref: "#/$defs/candidate"
17341748
usage-metadata:
1735-
uiOrder: 3
1749+
uiOrder: 4
17361750
$ref: "#/$defs/usage-metadata"
17371751
prompt-feedback:
1738-
uiOrder: 4
1752+
uiOrder: 5
17391753
$ref: "#/$defs/prompt-feedback"
17401754
model-version:
1741-
uiOrder: 5
1755+
uiOrder: 6
17421756
title: Model Version
17431757
description: The model version used to generate the response.
17441758
type: string
17451759
response-id:
1746-
uiOrder: 6
1760+
uiOrder: 7
17471761
title: Response ID
17481762
description: Identifier for this response.
17491763
type: string

pkg/component/ai/gemini/v0/io.go

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,9 @@ func (t TaskChatInput) GetSystemInstruction() *genai.Content { return t.SystemIn
5959
// TaskChatOutput is the output for the TASK_CHAT task.
6060
type TaskChatOutput struct {
6161
// Flattened chat output properties
62-
Texts []string `instill:"texts"`
63-
Usage map[string]any `instill:"usage"`
62+
Texts []string `instill:"texts"`
63+
Images []format.Image `instill:"images"`
64+
Usage map[string]any `instill:"usage"`
6465

6566
// Use genai types directly with instill tags
6667
Candidates []*genai.Candidate `instill:"candidates"`

pkg/component/ai/gemini/v0/task_chat.go

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,13 @@ package gemini
33
import (
44
"context"
55
"fmt"
6+
"strings"
67

78
"google.golang.org/genai"
89

910
"github.com/instill-ai/pipeline-backend/pkg/component/base"
11+
"github.com/instill-ai/pipeline-backend/pkg/data"
12+
"github.com/instill-ai/pipeline-backend/pkg/data/format"
1013
)
1114

1215
func (e *execution) chat(ctx context.Context, job *base.Job) error {
@@ -260,6 +263,7 @@ func (e *execution) mergeResponseChunk(r *genai.GenerateContentResponse, finalRe
260263
func (e *execution) buildStreamOutput(texts []string, finalResp *genai.GenerateContentResponse) TaskChatOutput {
261264
streamOutput := TaskChatOutput{
262265
Texts: texts,
266+
Images: []format.Image{},
263267
Usage: map[string]any{},
264268
Candidates: []*genai.Candidate{},
265269
UsageMetadata: nil,
@@ -281,6 +285,10 @@ func (e *execution) buildStreamOutput(texts []string, finalResp *genai.GenerateC
281285
streamOutput.ResponseID = &ri
282286
}
283287

288+
// Note: Image extraction and InlineData cleanup is deferred to renderFinal()
289+
// to avoid processing the same images multiple times during streaming.
290+
// Streaming responses will have empty Images array until the final response.
291+
284292
// Build usage map from UsageMetadata if available
285293
if finalResp.UsageMetadata != nil {
286294
streamOutput.Usage = buildUsageMap(finalResp.UsageMetadata)
@@ -348,6 +356,7 @@ func buildUsageMap(metadata *genai.GenerateContentResponseUsageMetadata) map[str
348356
func renderFinal(resp *genai.GenerateContentResponse, texts []string) TaskChatOutput {
349357
out := TaskChatOutput{
350358
Texts: []string{},
359+
Images: []format.Image{},
351360
Usage: map[string]any{},
352361
Candidates: []*genai.Candidate{},
353362
UsageMetadata: nil,
@@ -385,6 +394,31 @@ func renderFinal(resp *genai.GenerateContentResponse, texts []string) TaskChatOu
385394
}
386395
out.Texts = acc
387396
}
397+
398+
// Extract generated images from candidates and clean up InlineData to prevent raw binary exposure
399+
if len(resp.Candidates) > 0 {
400+
images := make([]format.Image, 0)
401+
for _, c := range resp.Candidates {
402+
if c.Content != nil {
403+
for _, p := range c.Content.Parts {
404+
if p != nil && p.InlineData != nil && strings.Contains(strings.ToLower(p.InlineData.MIMEType), "image") {
405+
// Convert blob data to format.Image using the standard data package approach
406+
// Normalize MIME type and use the existing NewImageFromBytes function
407+
normalizedMimeType := strings.ToLower(strings.TrimSpace(strings.Split(p.InlineData.MIMEType, ";")[0]))
408+
img, err := data.NewImageFromBytes(p.InlineData.Data, normalizedMimeType, "", true)
409+
if err == nil {
410+
images = append(images, img)
411+
}
412+
// Clean up InlineData to prevent raw binary data from being exposed in JSON output
413+
// The binary data is already extracted and converted to format.Image above
414+
p.InlineData = nil
415+
}
416+
}
417+
}
418+
}
419+
out.Images = images
420+
}
421+
388422
if resp.UsageMetadata != nil {
389423
out.Usage = buildUsageMap(resp.UsageMetadata)
390424
}

pkg/component/ai/gemini/v0/task_chat_test.go

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ import (
55
"encoding/base64"
66
"fmt"
77
"os"
8+
"strings"
89
"testing"
910
"time"
1011

@@ -1339,3 +1340,154 @@ func TestChatPerformanceOptimizations(t *testing.T) {
13391340
c.Check(videoTimeout > documentTimeout, qt.IsTrue)
13401341
})
13411342
}
1343+
1344+
func TestImageGeneration(t *testing.T) {
1345+
t.Parallel()
1346+
1347+
t.Run("image MIME type detection", func(t *testing.T) {
1348+
c := qt.New(t)
1349+
1350+
// Test the standard approach used in the codebase
1351+
// Test valid image MIME types
1352+
c.Check(strings.Contains(strings.ToLower("image/png"), "image"), qt.Equals, true)
1353+
c.Check(strings.Contains(strings.ToLower("image/jpeg"), "image"), qt.Equals, true)
1354+
c.Check(strings.Contains(strings.ToLower("image/gif"), "image"), qt.Equals, true)
1355+
c.Check(strings.Contains(strings.ToLower("image/webp"), "image"), qt.Equals, true)
1356+
c.Check(strings.Contains(strings.ToLower("IMAGE/PNG"), "image"), qt.Equals, true) // Case insensitive
1357+
1358+
// Test non-image MIME types
1359+
c.Check(strings.Contains(strings.ToLower("text/plain"), "image"), qt.Equals, false)
1360+
c.Check(strings.Contains(strings.ToLower("application/json"), "image"), qt.Equals, false)
1361+
c.Check(strings.Contains(strings.ToLower("video/mp4"), "image"), qt.Equals, false)
1362+
c.Check(strings.Contains(strings.ToLower("audio/wav"), "image"), qt.Equals, false)
1363+
})
1364+
1365+
t.Run("renderFinal with generated images", func(t *testing.T) {
1366+
c := qt.New(t)
1367+
1368+
// Create a mock response with generated images
1369+
// Use a simple 1x1 PNG image data
1370+
pngData := []byte{
1371+
0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, // PNG signature
1372+
0x00, 0x00, 0x00, 0x0D, 0x49, 0x48, 0x44, 0x52, // IHDR chunk
1373+
0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, // 1x1 dimensions
1374+
0x08, 0x02, 0x00, 0x00, 0x00, 0x90, 0x77, 0x53, // bit depth, color type, etc.
1375+
0xDE, 0x00, 0x00, 0x00, 0x0C, 0x49, 0x44, 0x41, // IDAT chunk
1376+
0x54, 0x08, 0x99, 0x01, 0x01, 0x01, 0x00, 0x00, // pixel data
1377+
0xFE, 0xFF, 0x00, 0x00, 0x02, 0x00, 0x01, 0xE5, // checksum
1378+
0x27, 0xDE, 0xFC, 0x00, 0x00, 0x00, 0x00, 0x49, // IEND chunk
1379+
0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82,
1380+
}
1381+
1382+
resp := &genai.GenerateContentResponse{
1383+
Candidates: []*genai.Candidate{
1384+
{
1385+
Content: &genai.Content{
1386+
Parts: []*genai.Part{
1387+
{Text: "Here's your generated image:"},
1388+
{InlineData: &genai.Blob{MIMEType: "image/png", Data: pngData}},
1389+
},
1390+
},
1391+
},
1392+
},
1393+
}
1394+
1395+
result := renderFinal(resp, nil)
1396+
1397+
// Check that text was extracted
1398+
c.Check(result.Texts, qt.HasLen, 1)
1399+
c.Check(result.Texts[0], qt.Equals, "Here's your generated image:")
1400+
1401+
// Check that images were extracted
1402+
c.Check(result.Images, qt.HasLen, 1)
1403+
c.Check(result.Images[0].ContentType().String(), qt.Equals, "image/png")
1404+
})
1405+
1406+
t.Run("buildStreamOutput with generated images", func(t *testing.T) {
1407+
c := qt.New(t)
1408+
1409+
// Mock execution for the method receiver
1410+
e := &execution{}
1411+
1412+
// Use a simple 1x1 PNG image data
1413+
pngData := []byte{
1414+
0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, // PNG signature
1415+
0x00, 0x00, 0x00, 0x0D, 0x49, 0x48, 0x44, 0x52, // IHDR chunk
1416+
0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, // 1x1 dimensions
1417+
0x08, 0x02, 0x00, 0x00, 0x00, 0x90, 0x77, 0x53, // bit depth, color type, etc.
1418+
0xDE, 0x00, 0x00, 0x00, 0x0C, 0x49, 0x44, 0x41, // IDAT chunk
1419+
0x54, 0x08, 0x99, 0x01, 0x01, 0x01, 0x00, 0x00, // pixel data
1420+
0xFE, 0xFF, 0x00, 0x00, 0x02, 0x00, 0x01, 0xE5, // checksum
1421+
0x27, 0xDE, 0xFC, 0x00, 0x00, 0x00, 0x00, 0x49, // IEND chunk
1422+
0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82,
1423+
}
1424+
1425+
texts := []string{"Generated image:"}
1426+
finalResp := &genai.GenerateContentResponse{
1427+
Candidates: []*genai.Candidate{
1428+
{
1429+
Content: &genai.Content{
1430+
Parts: []*genai.Part{
1431+
{Text: "Generated image:"},
1432+
{InlineData: &genai.Blob{MIMEType: "image/png", Data: pngData}},
1433+
},
1434+
},
1435+
},
1436+
},
1437+
}
1438+
1439+
result := e.buildStreamOutput(texts, finalResp)
1440+
1441+
// Check that texts are preserved
1442+
c.Check(result.Texts, qt.DeepEquals, texts)
1443+
1444+
// Check that images are NOT extracted during streaming (deferred to renderFinal)
1445+
c.Check(result.Images, qt.HasLen, 0)
1446+
})
1447+
1448+
t.Run("renderFinal with mixed content", func(t *testing.T) {
1449+
c := qt.New(t)
1450+
1451+
// Use a simple 1x1 PNG image data
1452+
pngData := []byte{
1453+
0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, // PNG signature
1454+
0x00, 0x00, 0x00, 0x0D, 0x49, 0x48, 0x44, 0x52, // IHDR chunk
1455+
0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, // 1x1 dimensions
1456+
0x08, 0x02, 0x00, 0x00, 0x00, 0x90, 0x77, 0x53, // bit depth, color type, etc.
1457+
0xDE, 0x00, 0x00, 0x00, 0x0C, 0x49, 0x44, 0x41, // IDAT chunk
1458+
0x54, 0x08, 0x99, 0x01, 0x01, 0x01, 0x00, 0x00, // pixel data
1459+
0xFE, 0xFF, 0x00, 0x00, 0x02, 0x00, 0x01, 0xE5, // checksum
1460+
0x27, 0xDE, 0xFC, 0x00, 0x00, 0x00, 0x00, 0x49, // IEND chunk
1461+
0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82,
1462+
}
1463+
1464+
// Create a response with text, images, and non-image content
1465+
resp := &genai.GenerateContentResponse{
1466+
Candidates: []*genai.Candidate{
1467+
{
1468+
Content: &genai.Content{
1469+
Parts: []*genai.Part{
1470+
{Text: "Here are your images: "},
1471+
{InlineData: &genai.Blob{MIMEType: "image/png", Data: pngData}},
1472+
{Text: " and "},
1473+
{InlineData: &genai.Blob{MIMEType: "image/png", Data: pngData}}, // Use PNG data with PNG MIME for valid image
1474+
{Text: " Done!"},
1475+
{InlineData: &genai.Blob{MIMEType: "audio/wav", Data: []byte("audio-data")}}, // Non-image, should be ignored
1476+
},
1477+
},
1478+
},
1479+
},
1480+
}
1481+
1482+
result := renderFinal(resp, nil)
1483+
1484+
// Check that all text parts were concatenated
1485+
c.Check(result.Texts, qt.HasLen, 1)
1486+
c.Check(result.Texts[0], qt.Equals, "Here are your images: and Done!")
1487+
1488+
// Check that only image parts were extracted (audio should be ignored)
1489+
c.Check(result.Images, qt.HasLen, 2)
1490+
c.Check(result.Images[0].ContentType().String(), qt.Equals, "image/png")
1491+
c.Check(result.Images[1].ContentType().String(), qt.Equals, "image/png")
1492+
})
1493+
}

0 commit comments

Comments
 (0)