Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. While multiple RAG benchmarks exist for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments.
In this work, we present DRAGON (Dynamic RAG benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian. DRAGON is built upon a regularly updated corpus of Russian news and public documents, and supports comprehensive evaluation of both retriever and generator components. We release all code and data to foster further research on retrieval-augmented generation and launch a public leaderboard.
Figure 1: Architecture of the DRAGON benchmark system.
Retrieval-Augmented Generation (RAG) has become a powerful instrument for enhancing the domain adaptation and factuality of large language models (LLMs) by incorporating external knowledge retrieved at inference time. This approach enables more up-to-date and grounded responses without the need for costly re-training. As RAG-based systems gain traction in applications such as open-domain question answering, customer support, scientific assistance, and enterprise search, rigorous and standardized evaluation becomes crucial for assessing their reliability, utility, and safety.
While several RAG benchmarks have been proposed for the English language, the situation for other languages, including Russian, is significantly less developed.
In this work, we focus on the evaluation of RAG systems for the Russian language. We introduce DRAGON: Dynamic RAG Benchmark on News a novel benchmark that reflects realistic usage patterns by leveraging a regularly updated knowledge base derived from current news sources. To foster transparency and community engagement, we publicly release the code, a modular evaluation suite, and a dynamic leaderboard to track progress on RAG-based systems in Russian.
Our contributions are as follows:
The benchmark is designed to evaluate retrieval-augmented generation (RAG) systems in a realistic way, dynamically evolving news domain. It's architecture prioritizes modularity, automation, and reproducibility while addressing the core challenges in the RAG evaluation landscape.
The whole pipeline of the benchmark architecture can be explored in Figure 1.
We have maintained a dedicated set of parsers which continuously crawl a curated selection of news sources including rambler.ru
, championat.com
, afisha.ru
, etc. on a daily basis. Newly parsed content being synchronized into cloud-based S3-like object storage. To avoid redundancy and ensure incremental updates, a scheduled automated job identifies differences with the previous dataset revision and extracts updated segments for downstream processing.
This design ensures the benchmark reflects evolving real-world distributions and mitigates the risks of overfitting to static datasets. The pipeline further ensures that newly surfaced topics and entities from the news stream are constantly incorporated into the benchmark.
Process of creation the questions and answers based on the updated increment of news data is described in detail in the Methodology section. This question and answers pairs forming the core of the benchmark. After this part is done we are producing 4 Hugging Face datasets:
public_id
to enable matching without exposing the true internal IDs.public_id
to obfuscate alignment and encourage retrieval.id
and corresponding public_id
.All datasets are versioned and uploaded to Hugging Face with incrementally updated revision numbers. This versioning mechanism allows reproducibility and provides users with stable snapshots for further experimentation.
To facilitate seamless evaluation for users, we provide a PyPi-hosted Python library rag_bench
, which offers an interface to:
User workflow includes loading public data, applying a custom RAG pipeline, and collecting results in the following form:
{
"0": {
"found_ids": [17, 69, 22, ...],
"model_answer": "На Головинских прудах"
},
...
}
These results encode both the retrieved public_id
s and the generated answers, decoupling the user's model output from any private evaluation artifacts. This separation allows secure evaluation without exposing ground-truth data.
Submitted results then are sent to the Validation Portal — a Flask-based backend with Single Page Application written in Vue as a frontend that performs secure evaluation using the private datasets. The portal:
public_id
s to internal id
s;Importantly, users submit only their results — all ground-truth data remains internal.
A Hugging Face Gradio Space serves as the public Leaderboard. Results are committed in a version-controlled file, automatically updated by the validation portal upon approval.
To reduce latency and improve benchmarking coverage, we support automatic evaluation for selected pre-approved models (e.g., LLaMA variants). These are computed via the same rag_bench
client, but using auto_submit
mode, authenticated with a privileged API token.
Figure 2: Leaderboard interface.
Given the dynamic nature of the benchmark, versioning plays a critical role in ensuring meaningful comparisons. Each evaluation result is tied to a specific dataset revision. On the leaderboard, users can:
This hybrid strategy balances fairness (by avoiding cross-version leakage) and practicality (by reflecting each model's best-known performance).
The Data Generation pipeline (Figure 3) consists of 2 stages: KG Extraction and Question Generation. The KG Extraction retrieves factual information from texts and preserves the most specific and fresh facts in form of a Knowledge Graph. The Question Generation module samples subgraphs of a certain structure to generate a question-answer pair with LLM.
Figure 3: Architecture of the Data Generation pipeline.
To enable greater control over question generation, we developed a Knowledge Graph Extraction module. Specifically, we prompt an LLM to extract triplets (subject-relation-object). Then the language model refines extracted relations, choosing the most plausible candidates. To maintain the novelty of the resulting knowledge graph, we discard any relations found in Wikidata. We operate under the assumption that facts not included in Wikidata are novel and thus unlikely to have been seen by the language model.
Question Generation stage starts with subgraph extraction. All possible subgraphs of particular shape are sampled. There are 4 distinct types of questions corresponding to the subgraph structures:
Each subgraph is passed to the language model, which is instructed to generate both a natural language question and corresponding answer. The question is formulated as a coherent sentence, while the answer consists of one or more entities derived from the subgraph.
@article{dragon_news_rag_2025,
title={DRAGON: Dynamic RAG Benchmark On News},
author={ML people},
year={2025},
journal={arXiv preprint arXiv:},
url={https://arxiv.org/abs/},
}