Bootstrapping an LLM Evaluation Data Set

LLMs
RAG
Author

Branden Collingsworth

Published

July 13, 2025

When you own proprietary content, public benchmarks provide little insight into real-world performance. This example shows how to turn your documents into an evaluation data set that is tailored to your domain.

Tree Monochrome

1. Ingest the Corpus and Create Traceable Chunks

  1. Load & fingerprint

    • Give every source file a stable hash (e.g., SHA-1) and store metadata such as title, date, and owner.
  2. Chunk intelligently

    • Fixed size: 300-600 tokens is a sweet spot for GPT-class models.
    • Or semantic: split on headings, bullet blocks, or \n\n using a recursive text splitter.
    • Overlap 10-20 % so that answers spanning boundaries are still inside one chunk.
  3. Persist

    • Table structure: {doc_id, chunk_id, text, start_pos, end_pos}.
    • Drop the chunks into a vector store now; you’ll reuse the embeddings later.

Fine-grained chunks keep questions specific, support exact citation, and allow for precise error analysis.

2. Generate Synthetic Question–Answer Pairs

SYSTEM  You are an expert quiz writer. Given the chunk below, generate up to 3
        non-trivial questions answerable *only* from this chunk. Return JSON:
        [{question, answer, support_lines}], where support_lines are quoted lines.

USER    <chunk_text>
  • Store results as {chunk_id, q_id, question, answer, support}.

Synthetic QAs are fast and cheap, but “support_lines” give you a handle to audit faithfulness.

3. Automated QA Sanity-Check

Run two filters:

  1. String & logic rules

    • Answer string must appear in support_lines.
    • Every support_line must exist in the original chunk.
  2. LLM fact-checker pass

SYSTEM  Fact-check strictly. Is the answer fully supported by the chunk?
        Respond JSON: {valid: true/false, reason: "..."}.

USER    {"chunk": <chunk_text>, "question": "...", "answer": "...", "support": "..."}

Reject anything flagged false. Tools like Ragas or TruLens wrap this in a single call if you want off-the-shelf metrics such as “faithfulness” and “answer_relevancy.”

Automated filters prune obvious hallucinations before humans ever see them.

4. Curate the Gold Split

  • Keep only rows that pass both checks.
  • Save as versioned JSONL (qa_v1.jsonl) so your experiments are reproducible.
  • Aim for at least 1 k QA pairs per domain-specific function you plan to evaluate (search, RAG, agent step, etc.).

5. Run the Target Model

SYSTEM  Answer the question *using only the provided context*. If not present,
        reply "NOT IN DOC".

USER    {"context": <chunk_text>, "question": <question>}

Log prediction, token counts, latency.

You can run multiple models or prompt variants; the dataset doubles as an A/B framework.

6. Score the Predictions

Metric When to use How
Exact / F1 match Extractive answers pred == gold or token-level F1
Semantic similarity Paraphrased answers SBERT cosine > 0.8 counts as correct
Faithfulness High-stakes use LLM grader: “Does answer quote or paraphrase supported text?”

Aggregate by document type, chunk length, or model version to spot regime changes.

7. Targeted Manual Audit

Audit three buckets:

  1. Random passes (2–5 %) catches silent QA issues.
  2. All failures determine if the model or the dataset is wrong.
  3. Outliers (slowest responses, longest answers) often signal prompt mistakes.

A quick Streamlit UI that shows chunk → gold QA → model answer → grader flag with a one-click “override” is enough.

8. Iterate

  • If the gold answer is invalid, regenerate or drop.
  • If grading is unfairly strict/lenient, tweak the grader prompt and re-run.
  • If the model stumbles on certain patterns (e.g., tables, footnotes), feed more such chunks back into Step 2.

Minimal Code Example

from openai import OpenAI
import hashlib, json, tqdm


def sha1(txt: str) -> str:
    return hashlib.sha1(txt.encode()).hexdigest()


def split_into_chunks(text, tokens=400, overlap=80):
    words = text.split()
    step = tokens - overlap
    for i in range(0, len(words), step):
        yield " ".join(words[i:i+tokens])


client = OpenAI()


def make_qas(chunk):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYS_QA_PROMPT},
            {"role": "user", "content": chunk}
        ],
        temperature=0.3
    )
    return json.loads(resp.choices[0].message.content)


# iterate over docs, split, generate, validate → save gold.jsonl
Back to top