I first came across this idea through Nick Saraev, who made a video about adapting Karpathy's autoresearch concept to improve his email engagement. His approach was simple: use the same modify-evaluate-keep/rollback loop, but instead of optimizing a neural network, optimize the text of his emails. Let the AI rewrite, score the result, keep the improvement, discard the rest, and loop.
That clicked for me immediately. The original autoresearch by Karpathy was designed for LLM training: give an AI agent a training setup and let it experiment autonomously. It modifies code, trains a model, checks if results improved, keeps the win or rolls back, and loops. You go to sleep, wake up to a log of 100 experiments. Here's his tweet explaining the concept.
Nick applied it to email copy. I thought: why not apply it to my CV?
The Problem
Writing a CV is slow. You tweak a bullet point, wonder if it's better, try another version, lose track of what worked. If you're applying to multiple roles, each one wants different emphasis. It's a manual optimization problem with no feedback signal.
What if I could automate the feedback part?
The Setup
I built a system called CV Autoresearch. It has three parts:
1. The CV (a markdown file, cv.md). This is the only thing the agent edits.
2. The evaluator (fixed, never touched). It scores the CV on four dimensions against real job descriptions:
- Keyword match (35%): are the JD's skills and tools present in the CV?
- Experience match (25%): does the seniority and years align?
- Responsibility match (25%): do the CV's duties map to the JD's requirements?
- Phrasing quality (15%): action verbs, quantification, ATS-friendliness
Each dimension scored 0-100. Weighted into a composite. Higher is better.
3. The job descriptions (two sets). A training set of 15+ real JDs from roles I'd actually apply to (scraped from job boards). A holdout set of 5 JDs the agent never optimizes against directly. The holdout is the real test: does the CV generalize, or did it just memorize the training JDs?
The Two-Model Architecture
This is the part that makes it work well. I use two different AI providers for two different jobs:
Claude Opus is the orchestrator. It reads the CV, reads the evaluation results, decides what to change, makes the edit, commits it, and manages the keep/rollback loop. It's the researcher. It has the context, the judgment, and the autonomy to run experiments back-to-back without stopping.
Groq (running openai/gpt-oss-120b) is the evaluator. Every time the orchestrator changes the CV, Groq scores it against all 20 job descriptions across four dimensions, three passes each for stability. It's the judge.
Using different models for editing and evaluating is deliberate. If the same model wrote the CV and scored it, it would optimize for its own biases. By using Claude to write and Groq to judge, I get variety. It's like having one person write the CV and a completely different person grade it. A simulated HR department of sorts, where the writer and the reviewer never coordinate.
For per-JD tailoring (after the general loop), I use a different Groq model (qwen/qwen3-32b) to add yet another perspective.
The Loop
This is the Karpathy pattern, adapted the same way Nick adapted it for emails:
LOOP:
1. Read cv.md and the results log
2. Read the evaluation output (dimensional breakdown)
3. Find the WEAKEST dimension across all JDs
4. Make ONE focused modification targeting that weakness
5. Commit the change
6. Run the evaluator (3 passes per JD, averaged for stability)
7. Compare composite score to previous best
8. If improved: KEEP the commit
9. If equal or worse: ROLLBACK (git reset)
10. GOTO 1
The agent runs this loop autonomously. No pausing to ask permission. Each iteration takes a few minutes (API calls to score against 20 JDs, three passes each). You set it going and walk away.
The Rules
The agent can reword bullets, add real keywords, restructure sections, quantify achievements, adjust the summary, emphasize relevant experience, reorder skills. All fair game.
What it cannot do: invent experience, fabricate companies or titles, claim certifications not earned, add skills never used. Every word in the CV must be truthful. The agent optimizes presentation, not content.
The Strategy
The agent follows a phased approach based on what the evaluator tells it:
Phase 1: Keyword injection. The evaluator reports frequently missing keywords. The agent adds terms the person genuinely has experience with. Quick wins.
Phase 2: Responsibility alignment. For JDs with low responsibility match, reword bullets to mirror the JD's language. Use the same verbs and phrases.
Phase 3: Experience emphasis. Quantify years, team sizes, budgets. Structure experience to highlight seniority level.
Phase 4: Phrasing polish. Replace weak verbs (helped becomes led, worked on becomes delivered). Add metrics and percentages. Lowest weight, so saved for last.
Phase 5: Structural experiments. If stuck, try different formats. Functional vs chronological. Grouped by skill area vs by role.
Overfitting Detection
This is why the holdout set matters. If the training JD score climbs but the holdout score drops, the CV is overfitting to specific JD wording instead of improving generally.
The agent watches the gap: if training and holdout diverge by more than 15 points, it shifts strategy toward general improvements (action verbs, quantification) over JD-specific terms.
The Evaluator in Detail
Each JD evaluation runs three passes through Groq and averages the scores. This smooths out LLM variance. The evaluator also runs a "reverse CV test": given this CV, what role would a recruiter say this person is best suited for? If the answer drifts from "Digital Project Manager" to something else, the optimization went sideways.
A gap analysis identifies the top 5 weaknesses across all JDs with concrete suggestions for fixing them using existing experience only.
The final composite score weighs role match (30%), training JD average (35%), and holdout JD average (35%). The holdout weight being equal to training is intentional. Generalization matters as much as fit.
Per-Job Tailoring
After the general optimization loop finishes, there's a second tool: the tailor. You point it at a specific job description and it takes the already-optimized CV and further adapts it for that exact role. It scores before and after to confirm the tailoring helped. Outputs markdown and PDF.
I used this to generate tailored CVs for specific applications at companies like Moka, BMS, Iubenda, and Scott. The general CV is the foundation. The tailor handles the last mile.
What I Learned
The feedback loop changes everything. Without it, CV writing is guessing. With it, you know exactly which dimension is weak and by how much. The agent doesn't guess. It reads the numbers and targets the weakest point.
Using different models for writing and judging is key. Same model doing both creates a feedback loop that optimizes for the model's own preferences, not for actual quality. The Claude/Groq split simulates having different people with different perspectives.
Keyword match is the easiest win. Most CVs score low on keywords simply because they use different words for the same things. The agent catches these mismatches fast.
The holdout set keeps you honest. Without it, the agent would overfit to specific JD phrasings. The holdout is your generalization check. If both scores go up together, the improvement is real.
Truthfulness is a hard constraint, not a suggestion. The agent works within the boundary of what's real. It finds better ways to say true things. That constraint is what makes the output usable, not just optimized.