Skip to content

Conversation

@jjoyce0510
Copy link
Collaborator

@jjoyce0510 jjoyce0510 commented Nov 4, 2025

Introducing Documents in DataHub (Context)

This PR introduces a new Document entity to DataHub, enabling users to create, manage, and organize first-party knowledge base content directly within the platform. Documents can be hierarchically organized, linked to data assets, and managed through a complete lifecycle including draft/publish workflows.

Core Data Models

Introduces comprehensive metadata models for the Document entity in DataHub:

Entity Definition

  • New document entity with key aspect documentKey and search capabilities
  • Full support for standard DataHub aspects: ownership, domains, tags, glossary terms, structured properties, institutional memory

Core Aspects (PDL Models)

  • DocumentKey - Unique identifier for documents
  • DocumentInfo - Primary aspect containing:
    • Title and text contents
    • Document status (PUBLISHED/UNPUBLISHED)
    • Source information (distinguishes first-party vs third-party ingested documents)
    • Audit stamps (created/lastModified with actor and timestamp)
    • Hierarchical parent-child relationships
    • Related assets (datasets, dashboards, etc.) and related documents
    • Draft workflow support via draftOf field
  • DocumentContents - Text content storage
  • DocumentStatus & DocumentState - Publication state management
  • DocumentSource - Tracking external sources for third-party integrations
  • ParentDocument, RelatedAsset, RelatedDocument - Relationship models
  • DraftOf - Draft-to-published document linking

GraphQL APIs

Comprehensive GraphQL API surface in knowledge.graphql:

Mutations

  1. createDocument - Create new documents with content, relationships, and hierarchy

    • Supports custom IDs or auto-generated UUIDs
    • Can create as draft or published
    • Automatic ownership assignment to creator
  2. updateDocumentContents - Update document text and title

  3. updateDocumentRelatedEntities - Manage relationships to assets and other documents

  4. moveDocument - Relocate documents within the hierarchy

  5. deleteDocument - Remove documents and their references

  6. updateDocumentStatus - Toggle between PUBLISHED/UNPUBLISHED states

  7. mergeDraft - Merge draft content into published document with optional draft deletion

Queries

  1. document(urn) - Fetch document by URN with full metadata
  2. searchDocuments - Hybrid semantic search with rich filtering:
    • Semantic query support
    • Filter by parent document (hierarchical browsing)
    • Filter by types, domains, states
    • Option to include/exclude drafts
    • Faceted search support

Special Features

  • drafts field - Lists all draft versions of a published document
  • changeHistory field - Chronological audit log of document modifications with support for: Content changes, Parent changes (moves), Relationship changes, State changes, etc.

Authorization & Privileges

New Platform Privilege

  • MANAGE_DOCUMENTS - Platform-level privilege for managing all documents

Entity-Level Privileges

Documents support standard DataHub entity privileges:

  • VIEW_ENTITY_PAGE / GET_ENTITY - View document
  • EDIT_ENTITY_DOCS / EDIT_ENTITY - Edit document content
  • CREATE_ENTITY - Create documents
  • EDIT_ENTITY_OWNERS - Manage ownership
  • EDIT_ENTITY_DOMAINS - Assign domains
  • SHARE_ENTITY - Share documents
  • EDIT_ENTITY_PROPERTIES - Edit structured properties

Authorization Logic

  • canCreateDocument() - Requires CREATE_ENTITY for documents or MANAGE_DOCUMENTS
  • canEditDocument() - Requires EDIT_ENTITY_DOCS, EDIT_ENTITY, or MANAGE_DOCUMENTS
  • canGetDocument() - Requires VIEW_ENTITY_PAGE or MANAGE_DOCUMENTS
  • canDeleteDocument() - Requires delete authorization or MANAGE_DOCUMENTS

Backend Services

DocumentService

Complete service layer implementation in metadata-service/services:

  • CRUD operations with validation
  • Draft workflow management (create, merge, track)
  • Hierarchical structure management (move operations)
  • Relationship management (assets and documents)
  • Ownership management
  • State transition handling
  • Full audit trail via lastModified timestamps

Timeline Support

  • DocumentInfoChangeEventGenerator - Generates change events for audit history
  • Tracks all modifications to document aspects
  • Integrates with DataHub's timeline service

Factory Beans

  • DocumentServiceFactory - Spring factory for service instantiation
  • Integration with GraphQL engine

Test Coverage

Smoke Tests

  • document_test.py (410 lines) - End-to-end document lifecycle tests
  • document_draft_test.py (326 lines) - Draft creation, merging, and workflows
  • document_change_history_test.py (281 lines) - Timeline and change tracking

Unit Tests

  • DocumentServiceTest.java (486 lines) - Service layer business logic
  • GraphQL resolver tests for all mutations and queries
  • DocumentMapperTest.java - Type mapping validation
  • DocumentInfoChangeEventGeneratorTest.java - Timeline event generation

Key Features & Use Cases

  1. Knowledge Base Management - Create and organize internal documentation, FAQs, tutorials, and runbooks
  2. Asset Documentation - Link documents to data assets for enriched context
  3. Draft Workflows - Work on document updates without publishing immediately
  4. Hierarchical Organization - Structure documents in parent-child relationships
  5. Semantic Search - Find relevant documents through hybrid search
  6. Change Tracking - Full audit history of all document modifications
  7. Third-Party Integration Ready - Source field supports ingesting external docs (Confluence, Notion, etc.)

This PR lays the foundation for DataHub to become a central knowledge hub, combining first-party documentation with data asset management in a unified platform.

Coming in a followup PR:

  • Add a browse paths for docs, enabling us to replicate hierarchical structure from other places.
  • Add the "container" story for docs. One option is to define a parent container type as a Dataset entity (e.g. Dataset = Collection of Documents) which is then itself within a container.
  • Models for document-level lineage, and UI support for creating document level lineage links.
  • Support Document Tags, Glossary Terms, and inclusion in Data Products

Status

Ready for review.

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Nov 4, 2025
@jjoyce0510 jjoyce0510 marked this pull request as ready for review November 5, 2025 22:06
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 5, 2025
@abedatahub abedatahub self-requested a review November 6, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants