Parsing Blocks

Parsing blocks extract structured data from text output, typically the text content produced by LLM blocks. This page covers three blocks: TagParserBlock for XML/HTML tag extraction, RegexParserBlock for regex pattern extraction, and JSONParserBlock for JSON parsing and field expansion.

All parsing blocks operate on pandas DataFrames. They take a single input column of text and produce one or more output columns of extracted values.


TagParserBlock

Parses text content using start/end tags. This is the recommended approach for extracting structured fields from LLM output that uses XML-style or custom delimiters.

Configuration

ParameterTypeDefaultDescription
block_namestrrequiredUnique identifier for this block instance
input_colsstr | list[str]requiredSingle input column containing text to parse
output_colslist[str]requiredOutput column names, one per tag pair
start_tagslist[str]requiredStart tags for extraction
end_tagslist[str]requiredEnd tags for extraction
parser_cleanup_tagsOptional[list[str]]NoneTags to remove from extracted content

The number of start/end tag pairs must equal the number of output columns. Exactly one input column is required. Both start_tags and end_tags accept a single string (auto-wrapped into a list) or a list of strings.

When the input column contains a list of strings instead of a single string, the block processes each list item and aggregates extracted values into lists.

Python Example

from sdg_hub.core.blocks import TagParserBlock
import pandas as pd

parser = TagParserBlock(
    block_name="extract_qa",
    input_cols="llm_content",
    output_cols=["question", "answer"],
    start_tags=["<question>", "<answer>"],
    end_tags=["</question>", "</answer>"],
)

dataset = pd.DataFrame({
    "llm_content": [
        "<question>What is Python?</question>\n<answer>A programming language.</answer>",
        "<question>What is AI?</question>\n<answer>Artificial intelligence.</answer>",
    ]
})

result = parser.generate(dataset)
# result["question"] -> ["What is Python?", "What is AI?"]
# result["answer"]   -> ["A programming language.", "Artificial intelligence."]

Cleanup Tags

Remove unwanted markup from extracted content:

from sdg_hub.core.blocks import TagParserBlock

parser = TagParserBlock(
    block_name="clean_extract",
    input_cols="llm_content",
    output_cols=["answer"],
    start_tags=["<answer>"],
    end_tags=["</answer>"],
    parser_cleanup_tags=["```", "**", "###"],
)

Multiple Matches

When the text contains multiple occurrences of a tag pair, each match becomes a separate row in the output:

from sdg_hub.core.blocks import TagParserBlock
import pandas as pd

parser = TagParserBlock(
    block_name="extract_items",
    input_cols="llm_content",
    output_cols=["item"],
    start_tags=["<item>"],
    end_tags=["</item>"],
)

dataset = pd.DataFrame({
    "llm_content": [
        "<item>First</item> <item>Second</item> <item>Third</item>"
    ]
})

result = parser.generate(dataset)
# Produces 3 rows, one for each <item>

YAML Example

blocks:
  - block_type: "TagParserBlock"
    block_config:
      block_name: "extract_qa"
      input_cols: "llm_content"
      output_cols:
        - "question"
        - "answer"
      start_tags:
        - "<question>"
        - "<answer>"
      end_tags:
        - "</question>"
        - "</answer>"
      parser_cleanup_tags:
        - "```"

RegexParserBlock

Parses text content using regex patterns with capture groups. Use this when extraction patterns do not follow a simple tag structure.

Configuration

ParameterTypeDefaultDescription
block_namestrrequiredUnique identifier for this block instance
input_colsstr | list[str]requiredSingle input column containing text to parse
output_colslist[str]requiredOutput column names, one per capture group
parsing_patternstrrequiredRegex pattern with capture groups
parser_cleanup_tagsOptional[list[str]]NoneTags to remove from extracted content

Exactly one input column is required. The regex is applied with re.DOTALL so . matches newlines. If the pattern has multiple capture groups, each group maps to a corresponding output column. If only one capture group is used, only one output column is needed.

When the input column contains a list of strings, the block processes each item and aggregates results.

Python Example

from sdg_hub.core.blocks import RegexParserBlock
import pandas as pd

parser = RegexParserBlock(
    block_name="extract_answer",
    input_cols="llm_content",
    output_cols=["answer"],
    parsing_pattern=r"Answer:\s*(.+?)(?:\n|$)",
)

dataset = pd.DataFrame({
    "llm_content": [
        "Reasoning: AI is broad.\nAnswer: Artificial Intelligence is a field of CS.\n",
        "Let me explain.\nAnswer: Machine learning enables pattern recognition.\n",
    ]
})

result = parser.generate(dataset)
# result["answer"] -> ["Artificial Intelligence is a field of CS.",
#                       "Machine learning enables pattern recognition."]

Input / Output:

llm_content (input)answer (output)
Reasoning: AI is broad.\nAnswer: Artificial Intelligence is a field of CS.\nArtificial Intelligence is a field of CS.
Let me explain.\nAnswer: Machine learning enables pattern recognition.\nMachine learning enables pattern recognition.

Multiple Capture Groups

from sdg_hub.core.blocks import RegexParserBlock
import pandas as pd

parser = RegexParserBlock(
    block_name="extract_score_reason",
    input_cols="llm_content",
    output_cols=["score", "reason"],
    parsing_pattern=r"Score:\s*(\d+)\s*Reason:\s*(.+?)(?:\n|$)",
)

dataset = pd.DataFrame({
    "llm_content": [
        "Score: 8\nReason: Clear and accurate explanation.",
    ]
})

result = parser.generate(dataset)
# result["score"]  -> ["8"]
# result["reason"] -> ["Clear and accurate explanation."]

YAML Example

blocks:
  - block_type: "RegexParserBlock"
    block_config:
      block_name: "extract_answer"
      input_cols: "llm_content"
      output_cols:
        - "answer"
      parsing_pattern: "Answer:\\s*(.+?)(?:\\n|$)"

JSONParserBlock

Parses JSON from text and expands fields into separate columns. Handles JSON embedded in surrounding text and fixes common LLM output issues such as trailing commas.

Configuration

ParameterTypeDefaultDescription
block_namestrrequiredUnique identifier for this block instance
input_colsstr | list[str]requiredSingle input column containing JSON text to parse
output_colslist[str][]Optional list of specific JSON fields to extract. If empty, all fields are extracted.
field_prefixstr""Prefix to add to extracted column names
fix_trailing_commasboolTrueWhether to fix trailing commas in JSON (common LLM output issue)
extract_embeddedboolTrueWhether to extract JSON embedded in surrounding text
drop_inputboolFalseWhether to drop the input column after extraction

Exactly one input column is required. When extract_embedded=True, the block finds JSON by locating the first { and last } in the text. JSON arrays are wrapped into {"items": [...]}. Non-dict/non-list JSON values are wrapped into {"value": ...}.

Python Example

from sdg_hub.core.blocks import JSONParserBlock
import pandas as pd

parser = JSONParserBlock(
    block_name="parse_json",
    input_cols="llm_content",
    output_cols=["topic", "summary"],
    drop_input=True,
)

dataset = pd.DataFrame({
    "llm_content": [
        'Here is the result: {"topic": "AI", "summary": "AI is transforming industries."}',
        '{"topic": "ML", "summary": "ML learns from data."}',
    ]
})

result = parser.generate(dataset)
# result["topic"]   -> ["AI", "ML"]
# result["summary"] -> ["AI is transforming industries.", "ML learns from data."]
# The "llm_content" column is dropped because drop_input=True

Extract All Fields

When output_cols is empty (or not set), all JSON fields become columns:

from sdg_hub.core.blocks import JSONParserBlock

parser = JSONParserBlock(
    block_name="parse_all",
    input_cols="llm_content",
    field_prefix="parsed_",
)

# If the JSON is {"name": "Alice", "age": 30}, the output has columns
# "parsed_name" and "parsed_age"

Handling Embedded JSON

With extract_embedded=True (default), the block extracts JSON even when the LLM wraps it in explanatory text:

from sdg_hub.core.blocks import JSONParserBlock
import pandas as pd

parser = JSONParserBlock(
    block_name="embedded_json",
    input_cols="llm_content",
    extract_embedded=True,
    fix_trailing_commas=True,
)

dataset = pd.DataFrame({
    "llm_content": [
        'Sure! Here is the data:\n{"key": "value",}\nHope this helps!',
    ]
})

result = parser.generate(dataset)
# The trailing comma is fixed, and the JSON is extracted from surrounding text
# result["key"] -> ["value"]

YAML Example

blocks:
  - block_type: "JSONParserBlock"
    block_config:
      block_name: "parse_json"
      input_cols: "llm_content"
      output_cols:
        - "topic"
        - "summary"
      field_prefix: ""
      fix_trailing_commas: true
      extract_embedded: true
      drop_input: true

Choosing a Parsing Block

BlockBest ForInput Format
TagParserBlockXML-style or custom tag delimiters (<tag>...</tag>)Structured text with consistent delimiters
RegexParserBlockFlexible patterns, key-value extraction, line-based formatsText with identifiable patterns but no XML tags
JSONParserBlockJSON output from LLMs, structured data extractionText containing JSON objects

Next Steps