Teaching LLMs to Understand Code Repositories Using Synthetic Knowledge Data
Abstract
Large language models (LLMs) often struggle to answer questions grounded in complex, domain-specific codebases, especially when deployed in private environments using smaller, open-weight models. These limitations arise from insufficient context understanding, weak reasoning capabilities, and a lack of alignment with specialized terminology or architecture. In this work, we present a synthetic data-driven framework for knowledge infusion using sdg_hub, an open-source tool developed to generate high-quality, document-grounded training data. Our method transforms curated technical documentation, including annotated code, example notebooks, and code documentation, into synthetic question–answer pairs and reasoning traces using teacher LLMs. We fine-tune Qwen 3 family models on this data. This targeted fine-tuning approach significantly improves the model's factual accuracy and reasoning performance on repository-specific tasks. Our results demonstrate that small LLMs, when properly customized, can serve as capable domain experts, complementing RAG pipelines while operating in secure, cost-efficient deployments.