Abstract

Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. While multiple RAG benchmarks exist for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments.

In this work, we present DRAGON (Dynamic RAG benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian. DRAGON is built upon a regularly updated corpus of Russian news and public documents, and supports comprehensive evaluation of both retriever and generator components. We release all code and data to foster further research on retrieval-augmented generation and launch a public leaderboard.

Introduction

Retrieval-Augmented Generation (RAG) has become a powerful instrument for enhancing the domain adaptation and factuality of large language models (LLMs) by incorporating external knowledge retrieved at inference time. This approach enables more up-to-date and grounded responses without the need for costly re-training. As RAG-based systems gain traction in applications such as open-domain question answering, customer support, scientific assistance, and enterprise search, rigorous and standardized evaluation becomes crucial for assessing their reliability, utility, and safety.

While several RAG benchmarks have been proposed for the English language, the situation for other languages, including Russian, is significantly less developed.

In this work, we focus on the evaluation of RAG systems for the Russian language. We introduce DRAGON: Dynamic RAG Benchmark on News a novel benchmark that reflects realistic usage patterns by leveraging a regularly updated knowledge base derived from current news sources. To foster transparency and community engagement, we publicly release the code, a modular evaluation suite, and a dynamic leaderboard to track progress on RAG-based systems in Russian.

Our contributions are as follows:

We propose DRAGON, the first RAG benchmark with a regularly updated knowledge base for the Russian language, designed to evaluate RAG systems in a dynamic setup.
We release an open-source evaluation pipeline, enabling reproducible experimentation and easy integration of new models and retrieval components.
We launch a public leaderboard to support reproducible, and community-driven research.

System Demo

The benchmark is designed to evaluate retrieval-augmented generation (RAG) systems in a realistic way, dynamically evolving news domain. It's architecture prioritizes modularity, automation, and reproducibility while addressing the core challenges in the RAG evaluation landscape.

Architecture of the Dynamic Benchmark

The whole pipeline of the benchmark architecture can be explored in Figure 1.

Data Acquisition and Processing

We have maintained a dedicated set of parsers which continuously crawl a curated selection of news sources including rambler.ru, championat.com, afisha.ru, etc. on a daily basis. Newly parsed content being synchronized into cloud-based S3-like object storage. To avoid redundancy and ensure incremental updates, a scheduled automated job identifies differences with the previous dataset revision and extracts updated segments for downstream processing.

This design ensures the benchmark reflects evolving real-world distributions and mitigates the risks of overfitting to static datasets. The pipeline further ensures that newly surfaced topics and entities from the news stream are constantly incorporated into the benchmark.

QA Dataset formation

Process of creation the questions and answers based on the updated increment of news data is described in detail in the Methodology section. This question and answers pairs forming the core of the benchmark. After this part is done we are producing 4 Hugging Face datasets:

Public Texts: Contains cleaned source documents. Each item is assigned a public_id to enable matching without exposing the true internal IDs.
Public Questions: Contains only questions, indexed via public_id to obfuscate alignment and encourage retrieval.
Private Texts: Used for evaluation only, and includes both the true id and corresponding public_id.
Private QA only: Provides canonical ground-truth for generative evaluation.

All datasets are versioned and uploaded to Hugging Face with incrementally updated revision numbers. This versioning mechanism allows reproducibility and provides users with stable snapshots for further experimentation.

User experience

Client Library

To facilitate seamless evaluation for users, we provide a PyPi-hosted Python library rag_bench, which offers an interface to:

Fetch the latest version of the public datasets by dynamically resolving the latest Hugging Face revision;
Observe the RAG system baseline which can be adopted for the target one;
Evaluate RAG system and package results for submission;
Submit results via API to our evaluation portal.

User workflow includes loading public data, applying a custom RAG pipeline, and collecting results in the following form:

{
    "0": {
        "found_ids": [17, 69, 22, ...],
        "model_answer": "На Головинских прудах"
    },
    ...
}

These results encode both the retrieved public_ids and the generated answers, decoupling the user's model output from any private evaluation artifacts. This separation allows secure evaluation without exposing ground-truth data.

Validation Portal

Submitted results then are sent to the Validation Portal — a Flask-based backend with Single Page Application written in Vue as a frontend that performs secure evaluation using the private datasets. The portal:

Maps submitted public_ids to internal ids;
Computes retrieval metrics such as Mean Reciprocal Rank (MRR) and Hit@k;
Computes generative metrics;
Stores evaluation logs and version metadata;
Requires admin approval before publishing to the leaderboard unless working with automatic submission.

Importantly, users submit only their results — all ground-truth data remains internal.

Leaderboard and Auto-Evaluation

A Hugging Face Gradio Space serves as the public Leaderboard. Results are committed in a version-controlled file, automatically updated by the validation portal upon approval.

To reduce latency and improve benchmarking coverage, we support automatic evaluation for selected pre-approved models (e.g., LLaMA variants). These are computed via the same rag_bench client, but using auto_submit mode, authenticated with a privileged API token.

Figure 2: Leaderboard interface.

Versioning Strategy

Given the dynamic nature of the benchmark, versioning plays a critical role in ensuring meaningful comparisons. Each evaluation result is tied to a specific dataset revision. On the leaderboard, users can:

View results from a single dataset version;
Toggle an "Actual Versions" mode to aggregate results across recent revisions;
Compare the most recent result for each model to others' latest submissions.

This hybrid strategy balances fairness (by avoiding cross-version leakage) and practicality (by reflecting each model's best-known performance).

Methodology

The Data Generation pipeline (Figure 3) consists of 2 stages: KG Extraction and Question Generation. The KG Extraction retrieves factual information from texts and preserves the most specific and fresh facts in form of a Knowledge Graph. The Question Generation module samples subgraphs of a certain structure to generate a question-answer pair with LLM.

Figure 3: Architecture of the Data Generation pipeline.

Knowledge Graph Extraction

To enable greater control over question generation, we developed a Knowledge Graph Extraction module. Specifically, we prompt an LLM to extract triplets (subject-relation-object). Then the language model refines extracted relations, choosing the most plausible candidates. To maintain the novelty of the resulting knowledge graph, we discard any relations found in Wikidata. We operate under the assumption that facts not included in Wikidata are novel and thus unlikely to have been seen by the language model.

Question Generation

Question Generation stage starts with subgraph extraction. All possible subgraphs of particular shape are sampled. There are 4 distinct types of questions corresponding to the subgraph structures:

Simple: a single triplet. The relation and one of the entities from the triplet is used to construct a question. Remaining entity becomes answer.
Set: number of triplets sharing relation and either object or subject. The question is generated using shared entity and relation. The answer consists of all other entities in the subgraph.
Conditional: pair of triplets, intersecting at single entity. The language model combines both facts to describe repeated entity in question. Repeated entity becomes an answer.
Multi-Hop: pair of triplets, intersecting at single entity. The question is constructed similar to simple question, however the repeated entity must not be mentioned in question. It is used as a bridge-entity, which is described in question as a reference extracted from another triplet.

Each subgraph is passed to the language model, which is instructed to generate both a natural language question and corresponding answer. The question is formulated as a coherent sentence, while the answer consists of one or more entities derived from the subgraph.

Citation

@article{dragon_news_rag_2025,
    title={DRAGON: Dynamic RAG Benchmark On News}, 
    author={ML people},
    year={2025},
    journal={arXiv preprint arXiv:},
    url={https://arxiv.org/abs/}, 
}

DRAGON: Dynamic RAG benchmark On News