Bootstrapping an LLM Evaluation Data Set
When you own proprietary content, public benchmarks provide little insight into real-world performance. This example shows how to turn your documents into an evaluation data set that is tailored to your domain.
1. Ingest the Corpus and Create Traceable Chunks
Load & fingerprint
- Give every source file a stable hash (e.g.,
SHA-1
) and store metadata such as title, date, and owner.
- Give every source file a stable hash (e.g.,
Chunk intelligently
- Fixed size: 300-600 tokens is a sweet spot for GPT-class models.
- Or semantic: split on headings, bullet blocks, or
\n\n
using a recursive text splitter. - Overlap 10-20 % so that answers spanning boundaries are still inside one chunk.
Persist
- Table structure:
{doc_id, chunk_id, text, start_pos, end_pos}
. - Drop the chunks into a vector store now; you’ll reuse the embeddings later.
- Table structure:
Fine-grained chunks keep questions specific, support exact citation, and allow for precise error analysis.
2. Generate Synthetic Question–Answer Pairs
SYSTEM You are an expert quiz writer. Given the chunk below, generate up to 3
non-trivial questions answerable *only* from this chunk. Return JSON:
[{question, answer, support_lines}], where support_lines are quoted lines.
USER <chunk_text>
- Store results as
{chunk_id, q_id, question, answer, support}
.
Synthetic QAs are fast and cheap, but “support_lines” give you a handle to audit faithfulness.
3. Automated QA Sanity-Check
Run two filters:
String & logic rules
- Answer string must appear in
support_lines
. - Every
support_line
must exist in the original chunk.
- Answer string must appear in
LLM fact-checker pass
SYSTEM Fact-check strictly. Is the answer fully supported by the chunk?
Respond JSON: {valid: true/false, reason: "..."}.
USER {"chunk": <chunk_text>, "question": "...", "answer": "...", "support": "..."}
Reject anything flagged false
. Tools like Ragas or TruLens wrap this in a single call if you want off-the-shelf metrics such as “faithfulness” and “answer_relevancy.”
Automated filters prune obvious hallucinations before humans ever see them.
4. Curate the Gold Split
- Keep only rows that pass both checks.
- Save as versioned JSONL (
qa_v1.jsonl
) so your experiments are reproducible. - Aim for at least 1 k QA pairs per domain-specific function you plan to evaluate (search, RAG, agent step, etc.).
5. Run the Target Model
SYSTEM Answer the question *using only the provided context*. If not present,
reply "NOT IN DOC".
USER {"context": <chunk_text>, "question": <question>}
Log prediction, token counts, latency.
You can run multiple models or prompt variants; the dataset doubles as an A/B framework.
6. Score the Predictions
Metric | When to use | How |
---|---|---|
Exact / F1 match | Extractive answers | pred == gold or token-level F1 |
Semantic similarity | Paraphrased answers | SBERT cosine > 0.8 counts as correct |
Faithfulness | High-stakes use | LLM grader: “Does answer quote or paraphrase supported text?” |
Aggregate by document type, chunk length, or model version to spot regime changes.
7. Targeted Manual Audit
Audit three buckets:
- Random passes (2–5 %) catches silent QA issues.
- All failures determine if the model or the dataset is wrong.
- Outliers (slowest responses, longest answers) often signal prompt mistakes.
A quick Streamlit UI that shows chunk → gold QA → model answer → grader flag with a one-click “override” is enough.
8. Iterate
- If the gold answer is invalid, regenerate or drop.
- If grading is unfairly strict/lenient, tweak the grader prompt and re-run.
- If the model stumbles on certain patterns (e.g., tables, footnotes), feed more such chunks back into Step 2.
Minimal Code Example
from openai import OpenAI
import hashlib, json, tqdm
def sha1(txt: str) -> str:
return hashlib.sha1(txt.encode()).hexdigest()
def split_into_chunks(text, tokens=400, overlap=80):
= text.split()
words = tokens - overlap
step for i in range(0, len(words), step):
yield " ".join(words[i:i+tokens])
= OpenAI()
client
def make_qas(chunk):
= client.chat.completions.create(
resp ="gpt-4o-mini",
model=[
messages"role": "system", "content": SYS_QA_PROMPT},
{"role": "user", "content": chunk}
{
],=0.3
temperature
)return json.loads(resp.choices[0].message.content)
# iterate over docs, split, generate, validate → save gold.jsonl