Bootstrapping an LLM Evaluation Data Set

LLMs

RAG

Author

Branden Collingsworth

Published

July 13, 2025

When you own proprietary content, public benchmarks provide little insight into real-world performance. This example shows how to turn your documents into an evaluation data set that is tailored to your domain.

1. Ingest the Corpus and Create Traceable Chunks

Load & fingerprint
- Give every source file a stable hash (e.g., SHA-1) and store metadata such as title, date, and owner.
Chunk intelligently
- Fixed size: 300-600 tokens is a sweet spot for GPT-class models.
- Or semantic: split on headings, bullet blocks, or \n\n using a recursive text splitter.
- Overlap 10-20 % so that answers spanning boundaries are still inside one chunk.
Persist
- Table structure: {doc_id, chunk_id, text, start_pos, end_pos}.
- Drop the chunks into a vector store now; you’ll reuse the embeddings later.

Fine-grained chunks keep questions specific, support exact citation, and allow for precise error analysis.

2. Generate Synthetic Question–Answer Pairs

SYSTEM  You are an expert quiz writer. Given the chunk below, generate up to 3
        non-trivial questions answerable *only* from this chunk. Return JSON:
        [{question, answer, support_lines}], where support_lines are quoted lines.

USER    <chunk_text>

Store results as {chunk_id, q_id, question, answer, support}.

Synthetic QAs are fast and cheap, but “support_lines” give you a handle to audit faithfulness.

3. Automated QA Sanity-Check

Run two filters:

String & logic rules
- Answer string must appear in support_lines.
- Every support_line must exist in the original chunk.
LLM fact-checker pass

SYSTEM  Fact-check strictly. Is the answer fully supported by the chunk?
        Respond JSON: {valid: true/false, reason: "..."}.

USER    {"chunk": <chunk_text>, "question": "...", "answer": "...", "support": "..."}

Reject anything flagged false. Tools like Ragas or TruLens wrap this in a single call if you want off-the-shelf metrics such as “faithfulness” and “answer_relevancy.”

Automated filters prune obvious hallucinations before humans ever see them.

4. Curate the Gold Split

Keep only rows that pass both checks.
Save as versioned JSONL (qa_v1.jsonl) so your experiments are reproducible.
Aim for at least 1 k QA pairs per domain-specific function you plan to evaluate (search, RAG, agent step, etc.).

5. Run the Target Model

SYSTEM  Answer the question *using only the provided context*. If not present,
        reply "NOT IN DOC".

USER    {"context": <chunk_text>, "question": <question>}

Log prediction, token counts, latency.

You can run multiple models or prompt variants; the dataset doubles as an A/B framework.

6. Score the Predictions

Metric	When to use	How
Exact / F1 match	Extractive answers	`pred == gold` or token-level F1
Semantic similarity	Paraphrased answers	SBERT cosine > 0.8 counts as correct
Faithfulness	High-stakes use	LLM grader: “Does answer quote or paraphrase supported text?”

Aggregate by document type, chunk length, or model version to spot regime changes.

7. Targeted Manual Audit

Audit three buckets:

Random passes (2–5 %) catches silent QA issues.
All failures determine if the model or the dataset is wrong.
Outliers (slowest responses, longest answers) often signal prompt mistakes.

A quick Streamlit UI that shows chunk → gold QA → model answer → grader flag with a one-click “override” is enough.

8. Iterate

If the gold answer is invalid, regenerate or drop.
If grading is unfairly strict/lenient, tweak the grader prompt and re-run.
If the model stumbles on certain patterns (e.g., tables, footnotes), feed more such chunks back into Step 2.

Minimal Code Example

from openai import OpenAI
import hashlib, json, tqdm


def sha1(txt: str) -> str:
    return hashlib.sha1(txt.encode()).hexdigest()


def split_into_chunks(text, tokens=400, overlap=80):
    words = text.split()
    step = tokens - overlap
    for i in range(0, len(words), step):
        yield " ".join(words[i:i+tokens])


client = OpenAI()


def make_qas(chunk):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYS_QA_PROMPT},
            {"role": "user", "content": chunk}
        ],
        temperature=0.3
    )
    return json.loads(resp.choices[0].message.content)


# iterate over docs, split, generate, validate → save gold.jsonl