Parsing Blocks
Parsing blocks extract structured data from text output, typically the text content produced by LLM blocks. This page covers three blocks: TagParserBlock for XML/HTML tag extraction, RegexParserBlock for regex pattern extraction, and JSONParserBlock for JSON parsing and field expansion.
All parsing blocks operate on pandas DataFrames. They take a single input column of text and produce one or more output columns of extracted values.
TagParserBlock
Parses text content using start/end tags. This is the recommended approach for extracting structured fields from LLM output that uses XML-style or custom delimiters.
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
block_name | str | required | Unique identifier for this block instance |
input_cols | str | list[str] | required | Single input column containing text to parse |
output_cols | list[str] | required | Output column names, one per tag pair |
start_tags | list[str] | required | Start tags for extraction |
end_tags | list[str] | required | End tags for extraction |
parser_cleanup_tags | Optional[list[str]] | None | Tags to remove from extracted content |
The number of start/end tag pairs must equal the number of output columns. Exactly one input column is required. Both start_tags and end_tags accept a single string (auto-wrapped into a list) or a list of strings.
When the input column contains a list of strings instead of a single string, the block processes each list item and aggregates extracted values into lists.
Python Example
from sdg_hub.core.blocks import TagParserBlock
import pandas as pd
parser = TagParserBlock(
block_name="extract_qa",
input_cols="llm_content",
output_cols=["question", "answer"],
start_tags=["<question>", "<answer>"],
end_tags=["</question>", "</answer>"],
)
dataset = pd.DataFrame({
"llm_content": [
"<question>What is Python?</question>\n<answer>A programming language.</answer>",
"<question>What is AI?</question>\n<answer>Artificial intelligence.</answer>",
]
})
result = parser.generate(dataset)
# result["question"] -> ["What is Python?", "What is AI?"]
# result["answer"] -> ["A programming language.", "Artificial intelligence."]Cleanup Tags
Remove unwanted markup from extracted content:
from sdg_hub.core.blocks import TagParserBlock
parser = TagParserBlock(
block_name="clean_extract",
input_cols="llm_content",
output_cols=["answer"],
start_tags=["<answer>"],
end_tags=["</answer>"],
parser_cleanup_tags=["```", "**", "###"],
)Multiple Matches
When the text contains multiple occurrences of a tag pair, each match becomes a separate row in the output:
from sdg_hub.core.blocks import TagParserBlock
import pandas as pd
parser = TagParserBlock(
block_name="extract_items",
input_cols="llm_content",
output_cols=["item"],
start_tags=["<item>"],
end_tags=["</item>"],
)
dataset = pd.DataFrame({
"llm_content": [
"<item>First</item> <item>Second</item> <item>Third</item>"
]
})
result = parser.generate(dataset)
# Produces 3 rows, one for each <item>YAML Example
blocks:
- block_type: "TagParserBlock"
block_config:
block_name: "extract_qa"
input_cols: "llm_content"
output_cols:
- "question"
- "answer"
start_tags:
- "<question>"
- "<answer>"
end_tags:
- "</question>"
- "</answer>"
parser_cleanup_tags:
- "```"RegexParserBlock
Parses text content using regex patterns with capture groups. Use this when extraction patterns do not follow a simple tag structure.
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
block_name | str | required | Unique identifier for this block instance |
input_cols | str | list[str] | required | Single input column containing text to parse |
output_cols | list[str] | required | Output column names, one per capture group |
parsing_pattern | str | required | Regex pattern with capture groups |
parser_cleanup_tags | Optional[list[str]] | None | Tags to remove from extracted content |
Exactly one input column is required. The regex is applied with re.DOTALL so . matches newlines. If the pattern has multiple capture groups, each group maps to a corresponding output column. If only one capture group is used, only one output column is needed.
When the input column contains a list of strings, the block processes each item and aggregates results.
Python Example
from sdg_hub.core.blocks import RegexParserBlock
import pandas as pd
parser = RegexParserBlock(
block_name="extract_answer",
input_cols="llm_content",
output_cols=["answer"],
parsing_pattern=r"Answer:\s*(.+?)(?:\n|$)",
)
dataset = pd.DataFrame({
"llm_content": [
"Reasoning: AI is broad.\nAnswer: Artificial Intelligence is a field of CS.\n",
"Let me explain.\nAnswer: Machine learning enables pattern recognition.\n",
]
})
result = parser.generate(dataset)
# result["answer"] -> ["Artificial Intelligence is a field of CS.",
# "Machine learning enables pattern recognition."]Input / Output:
| llm_content (input) | answer (output) |
|---|---|
Reasoning: AI is broad.\nAnswer: Artificial Intelligence is a field of CS.\n | Artificial Intelligence is a field of CS. |
Let me explain.\nAnswer: Machine learning enables pattern recognition.\n | Machine learning enables pattern recognition. |
Multiple Capture Groups
from sdg_hub.core.blocks import RegexParserBlock
import pandas as pd
parser = RegexParserBlock(
block_name="extract_score_reason",
input_cols="llm_content",
output_cols=["score", "reason"],
parsing_pattern=r"Score:\s*(\d+)\s*Reason:\s*(.+?)(?:\n|$)",
)
dataset = pd.DataFrame({
"llm_content": [
"Score: 8\nReason: Clear and accurate explanation.",
]
})
result = parser.generate(dataset)
# result["score"] -> ["8"]
# result["reason"] -> ["Clear and accurate explanation."]YAML Example
blocks:
- block_type: "RegexParserBlock"
block_config:
block_name: "extract_answer"
input_cols: "llm_content"
output_cols:
- "answer"
parsing_pattern: "Answer:\\s*(.+?)(?:\\n|$)"JSONParserBlock
Parses JSON from text and expands fields into separate columns. Handles JSON embedded in surrounding text and fixes common LLM output issues such as trailing commas.
Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
block_name | str | required | Unique identifier for this block instance |
input_cols | str | list[str] | required | Single input column containing JSON text to parse |
output_cols | list[str] | [] | Optional list of specific JSON fields to extract. If empty, all fields are extracted. |
field_prefix | str | "" | Prefix to add to extracted column names |
fix_trailing_commas | bool | True | Whether to fix trailing commas in JSON (common LLM output issue) |
extract_embedded | bool | True | Whether to extract JSON embedded in surrounding text |
drop_input | bool | False | Whether to drop the input column after extraction |
Exactly one input column is required. When extract_embedded=True, the block finds JSON by locating the first { and last } in the text. JSON arrays are wrapped into {"items": [...]}. Non-dict/non-list JSON values are wrapped into {"value": ...}.
Python Example
from sdg_hub.core.blocks import JSONParserBlock
import pandas as pd
parser = JSONParserBlock(
block_name="parse_json",
input_cols="llm_content",
output_cols=["topic", "summary"],
drop_input=True,
)
dataset = pd.DataFrame({
"llm_content": [
'Here is the result: {"topic": "AI", "summary": "AI is transforming industries."}',
'{"topic": "ML", "summary": "ML learns from data."}',
]
})
result = parser.generate(dataset)
# result["topic"] -> ["AI", "ML"]
# result["summary"] -> ["AI is transforming industries.", "ML learns from data."]
# The "llm_content" column is dropped because drop_input=TrueExtract All Fields
When output_cols is empty (or not set), all JSON fields become columns:
from sdg_hub.core.blocks import JSONParserBlock
parser = JSONParserBlock(
block_name="parse_all",
input_cols="llm_content",
field_prefix="parsed_",
)
# If the JSON is {"name": "Alice", "age": 30}, the output has columns
# "parsed_name" and "parsed_age"Handling Embedded JSON
With extract_embedded=True (default), the block extracts JSON even when the LLM wraps it in explanatory text:
from sdg_hub.core.blocks import JSONParserBlock
import pandas as pd
parser = JSONParserBlock(
block_name="embedded_json",
input_cols="llm_content",
extract_embedded=True,
fix_trailing_commas=True,
)
dataset = pd.DataFrame({
"llm_content": [
'Sure! Here is the data:\n{"key": "value",}\nHope this helps!',
]
})
result = parser.generate(dataset)
# The trailing comma is fixed, and the JSON is extracted from surrounding text
# result["key"] -> ["value"]YAML Example
blocks:
- block_type: "JSONParserBlock"
block_config:
block_name: "parse_json"
input_cols: "llm_content"
output_cols:
- "topic"
- "summary"
field_prefix: ""
fix_trailing_commas: true
extract_embedded: true
drop_input: trueChoosing a Parsing Block
| Block | Best For | Input Format |
|---|---|---|
TagParserBlock | XML-style or custom tag delimiters (<tag>...</tag>) | Structured text with consistent delimiters |
RegexParserBlock | Flexible patterns, key-value extraction, line-based formats | Text with identifiable patterns but no XML tags |
JSONParserBlock | JSON output from LLMs, structured data extraction | Text containing JSON objects |
Next Steps
- LLM Blocks -- language model interaction and prompt building
- Transform Blocks -- data manipulation and column operations
- Filtering Blocks -- quality control and row filtering