Teaching LLMs to Understand Code Repositories Using Synthetic Knowledge Data

Red Hat AI Innovation Team

2025

White Paper

Abstract

Large language models (LLMs) often struggle to answer questions grounded in complex, domain-specific codebases, especially when deployed in private environments using smaller, open-weight models. These limitations arise from insufficient context understanding, weak reasoning capabilities, and a lack of alignment with specialized terminology or architecture. In this work, we present a synthetic data-driven framework for knowledge infusion using sdg_hub, an open-source tool developed to generate high-quality, document-grounded training data. Our method transforms curated technical documentation, including annotated code, example notebooks, and code documentation, into synthetic question–answer pairs and reasoning traces using teacher LLMs. We fine-tune Qwen 3 family models on this data. This targeted fine-tuning approach significantly improves the model's factual accuracy and reasoning performance on repository-specific tasks. Our results demonstrate that small LLMs, when properly customized, can serve as capable domain experts, complementing RAG pipelines while operating in secure, cost-efficient deployments.

Download PDF

Repository Knowledge Agent

Teaching LLMs to Understand Code Repositories Using Synthetic Knowledge Data

Abstract