Methodology
How SPARK is built and how to read the scores.
SPARK is a rubric-based, AI-scored capacity assessment. This page is the full transparency document — what each dimension measures, what the score levels mean, how scoring works, where human review enters, and what the scores are and are not meant to support.
The six dimensions
Each dimension is anchored in published I-O psychology and cognitive science research. Construct definitions and full rubric content are versioned in the repository and referenced on every score record.
Reasoning Under Ambiguity
Making defensible analytical and decision-relevant judgments when the situation provides incomplete, conflicting, or evolving information. Anchored in Budner, Webster & Kruglanski, Kahneman, and Klein.
Learning Velocity
Acquiring and operationalizing new knowledge from unfamiliar domains quickly. Anchored in Dweck on growth mindset, Carroll's three-stratum theory of cognitive abilities, Ericsson on deliberate practice, and the transfer-of-learning literature.
Communication Architecture
Structuring complex information for a specific audience and purpose. Anchored in Grice's maxims, Sperber & Wilson on relevance, Sweller on cognitive load, and the rhetorical-genre tradition.
Creative Problem Decomposition
Breaking ill-structured problems into tractable sub-problems and surfacing non-obvious recombinations. Anchored in Newell & Simon, Wertheimer, Csikszentmihalyi, and Polya.
Judgment Calibration
Assessing confidence accurately and adjusting assertions to match what the evidence supports. Anchored in Lichtenstein & Fischhoff, Flavell on metacognition, Tetlock on forecasting, and Yates on judgment under uncertainty.
Stakeholder Navigation
Identifying interests, anticipating responses, designing moves that account for the landscape. Anchored in Mitchell, Agle & Wood on stakeholder salience, Allison & Zelikow on organizational politics, Pfeffer on power, and the negotiation literature (Fisher & Ury, Raiffa).
The 1-5 rubric
Each dimension scores responses on a five-level integer scale. Levels are behaviorally anchored — each level has 3-5 observable behaviors that mark a response at that level, plus boundary examples distinguishing each level from the one above. Levels are not "grades" in the GPA sense; they describe the structure of the reasoning the response demonstrates.
- Level 1. The response does not engage with the construct at all (e.g., asserts a conclusion without acknowledging the situation is ambiguous).
- Level 2. The response names the construct as a thing but does not use it to structure the analysis.
- Level 3. The response engages the construct competently within the immediate situation.
- Level 4. The response engages the construct skillfully, with explicit weighing, decision rules, or sequenced moves.
- Level 5. The response reasons about the structure of the situation itself — what kind of problem is being solved, what the situation reveals about the broader environment.
The differentiator is structural, not stylistic. Response length, vocabulary, and tone are not signals of any particular level. A structurally rich short response can earn Level 5; a structurally flat long response cannot.
How scoring works
SPARK uses an advanced AI algorithm to score each response against a published rubric. The algorithm is selected for its calibration accuracy on reasoning tasks and is paired with a human quality-review program. The specific model and rubric version used for each score is recorded with the result, so any report can be reproduced or audited.
- The candidate submits a response to one of the six scenarios.
- The response is sent to the AI scoring service running with
temperature=0.0for deterministic output. - The system prompt includes the full rubric for the dimension being scored: the construct definition, all five score-level anchors with their behavioral markers, the red-flag indicators, and three calibration examples at Levels 2, 4, and 5.
- The algorithm returns a structured output via tool use (not free-form text) containing: the score (1-5), a 3-4 paragraph rationale, verbatim evidence quotes from the candidate's response, and a structured report card (strengths, gaps, next steps).
- Guardrails run on the output before persistence: the score must be an integer 1-5; the evidence quotes must be verbatim substrings of the candidate's response; the report card must have at least one item in each of strengths/gaps/next-steps.
- If any guardrail fails, the algorithm is re-prompted once with a semantic-correction message. If the second response also fails, the scoring call is recorded as failed and the response is queued for human review.
Scoring takes roughly ten seconds per task. Every score is recorded with the exact rubric version that produced it, and every score is subject to human review under the SPARK quality program described below.
Human review
Three conditions trigger human review on a score:
- A guardrail violation that survives the semantic retry.
- A red-flag indicator detected by the model (fabricated certainty, off-topic response, demographic self-reference, harmful recommendation).
- A random sample of routine scores (currently ~5% per dimension) for ongoing reliability monitoring.
The human reviewer scores the response blind to the AI score, then the two are compared. Cohen's kappa between AI and human scores is computed on a rolling 200-score window. Target agreement is > 0.70 (substantial agreement); the system alerts on drops below 0.65.
Rubric versioning
Every rubric carries a semantic version (e.g., v1.0.0, v0.1.0). Active rubrics are immutable — changing the content of a rubric requires creating a new version, calibrating it, and migrating active scoring traffic to the new version with a documented effective date.
Current rubric status:
- Reasoning Under Ambiguity: v1.0.0, calibrated, in production.
- Learning Velocity, Communication Architecture, Creative Problem Decomposition, Judgment Calibration, Stakeholder Navigation: v0.1.0 interim. These rubrics are structurally complete and follow the same template as v1.0.0, but have not yet been calibrated against ≥500 real respondent responses by the I-O psychology partner. Calibration to v1.0.0 is a Phase 3 deliverable; the calibration sample is being collected now.
Every score in your report carries the rubric version that produced it. If a rubric is re-calibrated later, your historical score remains tied to its original version and can be re-evaluated against the new version on request.
Test integrity
SPARK is a developmental feedback instrument. A report is only useful to you if it reflects how you actually reason, which means responses have to be your own. The current assessment surface enforces this with two soft controls:
- Honor-code acknowledgement. Before any task is shown, you check a box committing not to use AI tools to draft or substantially edit your responses. The acknowledgement is timestamped and recorded with your session.
- Paste disabled on the response field. You cannot paste into the response textarea. The response has to be typed. This is a friction-based deterrent, not a perfect detector — but it ends low-effort paste-from-AI workflows and makes the social contract explicit.
We do not currently run AI-text classifiers on submitted responses. The construct we score (structured reasoning in prose) is precisely the pattern AI models produce well, which would make classifier-based detection unreliable in both directions and unfair to careful writers. As SPARK moves toward any hiring-credential positioning, real proctoring (live sessions, multi-modal evidence, behavioral telemetry) is documented as a Phase 3 requirement in the open-decisions register (OD-21).
Research foundation
The SPARK methodology is informed by research across seven knowledge domains: psychometric theory and validity frameworks, situational judgment testing and constructed response assessment, AI scoring reliability and LLM evaluation, fairness and adverse impact analysis, the six capacity dimensions (reasoning under ambiguity, learning velocity, communication architecture, creative problem decomposition, judgment calibration, and stakeholder navigation), automated essay scoring methodology, and U.S. employment law. The full research bibliography is available below.
Built on 275+ peer-reviewed studies, professional standards, and legal authorities.
Bibliography PDF is being finalized. The link will appear here once the document is published.
Limitations and intended use
Use for development, not employment decisions. SPARK is positioned as developmental feedback for individuals and teams. Scores are not designed for hiring, firing, performance review, promotion, or compensation decisions. Using AI-scored assessments in employment decisions triggers EEOC UGESP, NYC Local Law 144, Illinois HB 3773, and Colorado SB 24-205 compliance obligations that SPARK is not yet built to satisfy. The five interim rubrics specifically have not been validated for employment use and will not be until the Phase 3 calibration study completes.
The construct is reasoning in writing, not real-world performance. A high SPARK score indicates the candidate can demonstrate the cognitive moves in writing under low time pressure. Whether they do so in their actual work is a different question that SPARK does not measure.
Generative AI can produce competent SPARK responses. A candidate who uses an AI tool to write their responses may score well. Test-integrity controls for the commercial version are documented in OD-21 of the repo's open-decisions register. For the current developmental positioning, this is acceptable; for any future hiring-credential positioning it would not be.