YAML Reference

Every flow is defined in a single flow.yaml file. This page documents every field, with types and defaults verified against source.


Top-Level Structure

A flow YAML has three top-level sections: metadata, parameters (optional), and blocks.

metadata:
  # Flow identity and configuration (see below)

parameters:
  # Optional runtime parameter definitions (see below)

blocks:
  # Ordered list of block definitions (see below)

The FlowValidator (source: src/sdg_hub/core/flow/validation.py) enforces:

  • blocks is required and must be a non-empty list.
  • Each block must have block_type and block_config keys.
  • Each block_config must contain block_name.
  • metadata, if present, must be a dict with a non-empty name string.

metadata Section

Source: src/sdg_hub/core/flow/metadata.py -- class FlowMetadata

FieldTypeRequiredDefaultDescription
namestrYes--Human-readable flow name. Minimum 1 character.
idstrNoauto-generatedUnique identifier. Must be lowercase, alphanumeric with hyphens, no leading/trailing hyphens. Auto-generated from name if omitted.
descriptionstrNo""What the flow does.
versionstrNo"1.0.0"Semantic version matching ^\d+\.\d+\.\d+(-[a-zA-Z0-9.-]+)?$.
authorstrNo""Author or contributor name.
licensestrNo"Apache-2.0"License identifier.
tagslist[str]No[]Tags for categorization. Automatically lowercased.
recommended_modelsRecommendedModelsNoNoneModel recommendations (see below).
dataset_requirementsDatasetRequirementsNoNoneInput dataset validation rules (see below).
output_columnslist[str]NoNoneColumns to keep in final output. Original input columns are always preserved. When set, intermediate columns are dropped during and after execution. Must be non-empty if specified; omit entirely to keep all columns.

Source: src/sdg_hub/core/flow/metadata.py -- class RecommendedModels

recommended_models:
  default: "openai/gpt-oss-120b"
  compatible:
    - "meta-llama/Llama-3.3-70B-Instruct"
    - "microsoft/phi-4"
  experimental:
    - "gpt-4o"
FieldTypeRequiredDefaultDescription
defaultstrYes--Primary recommended model. Cannot be empty.
compatiblelist[str]No[]Models known to work well.
experimentallist[str]No[]Models not extensively tested.

Model selection priority: default first, then compatible in order, then experimental in order.

dataset_requirements

Source: src/sdg_hub/core/flow/metadata.py -- class DatasetRequirements

dataset_requirements:
  required_columns:
    - "document"
    - "document_outline"
  optional_columns:
    - "metadata"
  min_samples: 1
  max_samples: 10000
  column_types:
    document: "string"
  description: "Input documents for processing"
FieldTypeRequiredDefaultDescription
required_columnslist[str]No[]Columns that must be present. Flow fails if missing.
optional_columnslist[str]No[]Columns that can enhance performance.
min_samplesintNo1Minimum row count. Must be >= 1.
max_samplesintNoNoneMaximum row count. Must be >= min_samples if set.
column_typesdict[str, str]No{}Expected types for columns (documentation only).
descriptionstrNo""Human-readable description of requirements.

parameters Section

Optional. Defines flow-level parameters that can be overridden at runtime. Each parameter is a key-value pair where the key is the parameter name.

parameters:
  temperature:
    type: "float"
    default: 0.7
    description: "Sampling temperature for generation"
    required: false

  max_tokens:
    type: "integer"
    default: 2048
    description: "Maximum token count"
FieldTypeRequiredDescription
defaultanyYesDefault value for the parameter.
typestrNoType hint (e.g., "string", "float", "integer").
descriptionstrNoHuman-readable description.
requiredboolNoWhether the parameter must be provided at runtime.

blocks Section

An ordered list of block definitions. Each entry specifies a block type and its configuration.

blocks:
  - block_type: "PromptBuilderBlock"
    block_config:
      block_name: "build_summary_prompt"
      input_cols:
        - "text"
      output_cols: "summary_prompt"
      prompt_config_path: "prompts/summarize.yaml"

  - block_type: "LLMChatBlock"
    block_config:
      block_name: "generate_summary"
      input_cols: "summary_prompt"
      output_cols: "raw_summary"
      max_tokens: 1024
      temperature: 0.3
      async_mode: true

Block entry fields

FieldTypeRequiredDescription
block_typestrYesClass name of the block (must exist in BlockRegistry). Valid values: AgentBlock, AgentResponseExtractorBlock, ColumnValueFilterBlock, DuplicateColumnsBlock, IndexBasedMapperBlock, JSONParserBlock, JSONStructureBlock, LLMChatBlock, LLMResponseExtractorBlock, MCPAgentBlock, MeltColumnsBlock, PromptBuilderBlock, RegexParserBlock, RenameColumnsBlock, RowMultiplierBlock, SamplerBlock, TagParserBlock, TextConcatBlock, UniformColumnValueSetter.
block_configdictYesConfiguration passed to the block constructor.
block_config.block_namestrYesUnique name within the flow.

The block_config contents depend on the block type. Common fields include input_cols, output_cols, and block-specific parameters. See the Block documentation for per-block configuration.

Path resolution

Path fields in block_config (config_path, config_paths, prompt_config_path) are resolved relative to the directory containing the flow.yaml file. For example, if flow.yaml is at flows/my_flow/flow.yaml and a block specifies prompt_config_path: prompts/summary.yaml, it resolves to flows/my_flow/prompts/summary.yaml.


Complete Annotated Example

This is the Structured Text Insights Extraction Flow, taken from src/sdg_hub/flows/text_analysis/structured_insights/flow.yaml:

metadata:
  id: green-clay-812
  name: "Structured Text Insights Extraction Flow"
  description: >-
    Multi-step pipeline for extracting structured insights from text including
    summary, keywords, entities, and sentiment analysis combined into a JSON output
  version: "1.0.0"
  author: "SDG Hub Contributors"
  recommended_models:
    default: "openai/gpt-oss-120b"
    compatible:
      - "meta-llama/Llama-3.3-70B-Instruct"
      - "microsoft/phi-4"
      - "mistralai/Mixtral-8x7B-Instruct-v0.1"
    experimental:
      - "gpt-4o"
  tags:
    - "text-analysis"
    - "summarization"
    - "nlp"
    - "structured-output"
    - "insights"
    - "sentiment-analysis"
    - "entity-extraction"
    - "keyword-extraction"
  license: "Apache-2.0"
  dataset_requirements:
    required_columns:
      - "text"
    description: >-
      Input dataset should contain text content for analysis. Each text should be
      substantial enough for meaningful analysis (minimum 50 words recommended).

blocks:
  # Step 1: Build a prompt for summary extraction
  - block_type: "PromptBuilderBlock"
    block_config:
      block_name: "build_summary_prompt"
      input_cols:
        - "text"
      output_cols: "summary_prompt"
      prompt_config_path: "prompts/summarize.yaml"

  # Step 2: Generate the summary via LLM
  - block_type: "LLMChatBlock"
    block_config:
      block_name: "generate_summary"
      input_cols: "summary_prompt"
      output_cols: "raw_summary"
      max_tokens: 1024
      temperature: 0.3
      async_mode: true

  # Step 3: Extract the assistant message content
  - block_type: "LLMResponseExtractorBlock"
    block_config:
      block_name: "extract_summary"
      input_cols: "raw_summary"
      extract_content: true
      expand_lists: true

  # Step 4: Parse the summary from tagged output
  - block_type: "TagParserBlock"
    block_config:
      block_name: "parse_summary"
      input_cols: "extract_summary_content"
      output_cols: "summary"
      start_tags:
        - "[SUMMARY]"
      end_tags:
        - "[/SUMMARY]"

  # Steps 5-8: Same pattern for keywords extraction
  - block_type: "PromptBuilderBlock"
    block_config:
      block_name: "build_keywords_prompt"
      input_cols:
        - "text"
      output_cols: "keywords_prompt"
      prompt_config_path: "prompts/extract_keywords.yaml"
  - block_type: "LLMChatBlock"
    block_config:
      block_name: "generate_keywords"
      input_cols: "keywords_prompt"
      output_cols: "raw_keywords"
      max_tokens: 512
      temperature: 0.3
      async_mode: true
  - block_type: "LLMResponseExtractorBlock"
    block_config:
      block_name: "extract_keywords"
      input_cols: "raw_keywords"
      extract_content: true
      expand_lists: true
  - block_type: "TagParserBlock"
    block_config:
      block_name: "parse_keywords"
      input_cols: "extract_keywords_content"
      output_cols: "keywords"
      start_tags:
        - "[KEYWORDS]"
      end_tags:
        - "[/KEYWORDS]"

  # Steps 9-12: Entities extraction (same pattern)
  - block_type: "PromptBuilderBlock"
    block_config:
      block_name: "build_entities_prompt"
      input_cols:
        - "text"
      output_cols: "entities_prompt"
      prompt_config_path: "prompts/extract_entities.yaml"
  - block_type: "LLMChatBlock"
    block_config:
      block_name: "generate_entities"
      input_cols: "entities_prompt"
      output_cols: "raw_entities"
      max_tokens: 1024
      temperature: 0.3
      async_mode: true
  - block_type: "LLMResponseExtractorBlock"
    block_config:
      block_name: "extract_entities"
      input_cols: "raw_entities"
      extract_content: true
      expand_lists: true
  - block_type: "TagParserBlock"
    block_config:
      block_name: "parse_entities"
      input_cols: "extract_entities_content"
      output_cols: "entities"
      start_tags:
        - "[ENTITIES]"
      end_tags:
        - "[/ENTITIES]"

  # Steps 13-16: Sentiment analysis (same pattern)
  - block_type: "PromptBuilderBlock"
    block_config:
      block_name: "build_sentiment_prompt"
      input_cols:
        - "text"
      output_cols: "sentiment_prompt"
      prompt_config_path: "prompts/analyze_sentiment.yaml"
  - block_type: "LLMChatBlock"
    block_config:
      block_name: "generate_sentiment"
      input_cols: "sentiment_prompt"
      output_cols: "raw_sentiment"
      max_tokens: 256
      temperature: 0.1
      async_mode: true
  - block_type: "LLMResponseExtractorBlock"
    block_config:
      block_name: "extract_sentiment"
      input_cols: "raw_sentiment"
      extract_content: true
      expand_lists: true
  - block_type: "TagParserBlock"
    block_config:
      block_name: "parse_sentiment"
      input_cols: "extract_sentiment_content"
      output_cols: "sentiment"
      start_tags:
        - "[SENTIMENT]"
      end_tags:
        - "[/SENTIMENT]"

  # Step 17: Combine all analyses into a JSON structure
  - block_type: "JSONStructureBlock"
    block_config:
      block_name: "create_structured_insights"
      input_cols:
        - "summary"
        - "keywords"
        - "entities"
        - "sentiment"
      output_cols:
        - "structured_insights"
      ensure_json_serializable: true

This flow demonstrates the common pattern: PromptBuilderBlock builds messages, LLMChatBlock generates, LLMResponseExtractorBlock extracts the assistant content, and TagParserBlock parses tagged regions.