As skill assessment moves into a new era, the convergence of Artificial Intelligence (AI) and Psychometrics is reshaping how we evaluate human abilities. At Maki People, we leverage AI technologies combined with a variety of psychometric techniques to ensure reliability, fairness and validity in AI-powered assessments.
This article outlines our current methodology and ongoing advancements in AI-based skill assessment, providing insights into how AI-scored tests—whether in multiple-choice, written or spoken formats—can match or eventually exceed human raters in accuracy, objectivity, and fairness.
From Traditional Multiple Choice to AI-Driven Open Text Assessment
Recruitment assessment has historically relied on structured formats like multiple-choice questions, offering limited insight into a candidate's critical thinking and job-relevant abilities. These traditional formats constrain how candidates can demonstrate their capabilities, often resulting in an incomplete picture of their potential. Moreover, human raters have traditionally been utilized to evaluate open-ended responses during hiring processes. The scoring process is slow and human scoring comes with inherent limitations such as subjectivity, inconsistency between recruiters, time constraints, and challenges in scaling for high-volume hiring. While, frameworks like Item Response Theory (IRT) and Structural Equation Modeling (SEM) ensure the robustness of assessments, they have been primarily applied to structured test formats that don't fully capture a candidate's problem-solving approach or communication style.
By incorporating speech recognition and text-to-speech synthesis models alongside text-based AI evaluation, we extend beyond standard assessment experiences to assess a wider range of candidate responses—including both written and spoken formats. STT models ensure accurate transcription of candidate speech into text, while TTS technology enhances automated interviewer interactions, creating a more dynamic and natural assessment experience.
In conjunction with the advent of Generative AI (Gen-AI), we are witnessing a paradigm shift in recruitment assessment. This technology enables organizations to move beyond traditional approaches in several transformative ways:
- Gen-AI utilises large language models (LLMs) to evaluate complex, open-ended responses that demonstrate job-relevant competencies.
- Gen-AI leverages LLMs and natural language processing (NLP) to assess power skills crucial for workplace success.
- Gen-AI applies consistent scoring criteria across thousands of applications without fatigue or bias.
- Gen-AI analyzes candidate reasoning in real time, providing immediate insights to hiring teams.
- Gen-AI employs LLMs to generate content on the fly for bespoke client-specific tests.
However, this transition introduces important challenges that must be addressed:
- Are scores generated by LLMs comparable to experienced hiring manager evaluations?
- Does AI scoring introduce or amplify systematic biases across demographic groups?
- Can AI-generated content achieve psychometric equivalence to professional item writers?
To address these concerns, Maki People employs advanced statistical methods to rigorously evaluate AI-generated scores, ensuring that our recruitment assessments are not only efficient and scalable but also scientifically sound and fair for candidates.
Evaluating AI Scoring: Inter-Rater Agreement and Calibration Analysis
A crucial step in our validation process is assessing the agreement between AI-generated scores and human raters. We use Inter-Rater Reliability (IRR) metrics, such as Intraclass Correlation Coefficients (ICC) and Cohen’s Kappa, to evaluate the consistency between AI and expert scorers.
To thoroughly investigate potential discrepancies between LLMs and human scoring patterns, we employ:
- Bland-Altman analysis to identify systematic differences or biases in scoring across the range of performance levels
- Statistical analysis with demographic variables to detect if AI-human score differences are associated with specific candidate characteristics
- Regression analysis to determine if factors beyond candidate ability influence scoring discrepancies
- Error pattern analysis to identify specific types of responses where AI and human ratings tend to diverge
This comprehensive approach ensures that our AI scoring models are calibrated to align with human expert judgment while maintaining fairness and accuracy across diverse candidate populations. (Please refer to [Building Transparent AI: Why Trust Is Non-Negotiable] for more details on how we conduct evaluations on LLMs.
Later in our validation process, we do conduct further analyses to ensure our assessment system as a whole functions fairly across demographic groups, but the initial AI-human comparison focuses on establishing scoring equivalence.
Evaluating Speech-to-Text (STT) Models with Expert Review
One of the critical aspects of AI-driven assessment is the accurate transcription and interpretation of spoken responses. At Maki People, we rigorously evaluate Speech-to-Text (STT) models to ensure their accuracy, consistency, and applicability across diverse candidate populations. While many off-the-shelf STT solutions claim high accuracy, their real-world performance varies significantly across accents, speech patterns, and background noise conditions.
To validate STT models, we employ a dual-layered evaluation process:
- Expert Human Review: Transcriptions generated by STT models are compared against expert or native -transcribed speech data, using metrics such as Word Error Rate (WER), Phoneme Error Rate (PER), and Semantic Error Rate (SER). This ensures that the model correctly captures key linguistic nuances.
- Empirical Evaluation: We assess whether transcription errors systematically affect test scores, particularly for non-native speakers or individuals with specific accents or speech impairments.
- Impact on Automated Scoring: STT outputs serve as the input for LLMs that perform automated scoring. When our analysis identifies languages or dialects where STT performance is inadequate, we exclude these languages from our automated speech assessment pipeline to maintain assessment integrity.
Beyond speech-to-text transcription, AI-powered assessments must reliably evaluate pronunciation quality, particularly in multilingual settings. While many AI-driven pronunciation models claim to support multiple languages, their effectiveness across different linguistic contexts varies significantly.
To assess the performance of pronunciation assessment models, we conducted a structured evaluation across multiple languages. Our methodology involved:
- Native Speaker Review: Native speakers rated the accuracy of pronunciation scores, categorizing them as low accuracy, accurate, or high accuracy.
- Independent Expert Review: A second layer of evaluation was conducted by external linguistic experts, ensuring a broader perspective.
- Comparative Analysis: We compared ratings from native speakers and experts to identify systematic discrepancies and assess the reliability of the model across languages.
These findings underscore the importance of rigorous validation when integrating pronunciation assessment models into AI-driven evaluations. Systematic biases—especially across diverse linguistic groups—must be carefully monitored to maintain fairness and accuracy in automated speech assessments.
At Maki People, we apply a similar dual-layered evaluation approach to assess the reliability of AI-based scoring models, ensuring that both transcription quality and pronunciation evaluation are validated before deployment.
Psychometric Evaluations
Despite our innovative use of AI technologies in the recruitment and assessment process, we remain committed to psychometric rigor. We employ well-established psychometric frameworks to ensure our assessments meet the highest scientific standards:
- IRT for Item-Level Analysis: Item Response Theory focuses on the performance of individual items, ensuring that each question is psychometrically sound and contributes meaningfully to the assessment.
- SEM for Construct-Level Validation: Structural Equation Modeling evaluates the overall structure and validity of the assessment, ensuring that the test as a whole measures the intended latent constructs accurately and consistently.
- DIF for Fairness Analysis: Differential Item Functioning analysis examines whether assessment items perform equivalently across different demographic groups, detecting and addressing potential sources of bias to ensure equitable evaluation of all candidates regardless of background.
- Psychometric Rigor: Together, IRT and SEM ensure that our AI-powered assessments satisfy rigorous psychometric criteria for reliability, validity, and fairness, maintaining the scientific integrity expected in high-stakes assessment contexts.
This comprehensive approach integrates cutting-edge AI with established psychometric methodologies, ensuring that our assessment system delivers both innovation and scientific credibility. By leveraging these complementary frameworks, we maintain assessment quality while harnessing the efficiency and scalability advantages of AI-driven evaluation.
One such example of our psychometric evaluation process is the Problem Solving (Advanced) Test, a widely used assessment designed to measure critical thinking, logical reasoning, and analytical skills—all essential for evaluating complex information and making data-driven decisions.
To ensure scientific rigor and fairness, we applied:
- Item Response Theory (IRT) to assess item difficulty and discrimination, ensuring the test effectively differentiates between high- and low-performing candidates. The test demonstrated high internal consistency (0.87) and strong psychometric properties.
- Differential Item Functioning (DIF) analysis to verify that items perform equally across different demographic groups (gender, age, ethnicity). No significant bias was found, reinforcing the test’s equitable nature.
This is just one example of how we rigorously evaluate our assessments, ensuring they are scientifically robust, fair, and predictive of real-world performance, while leveraging the efficiency and scalability of AI-driven evaluation.
Beyond Scoring: LLMs for Content Generation in Assessment
While our focus thus far has been on AI-driven scoring, Large Language Models offer tremendous value in assessment content generation that extends far beyond evaluation capabilities. At Maki People, we leverage LLMs to revolutionize the entire assessment creation process, ensuring content that is diverse and relevant prior to psychometric testing.
The benefits of LLM-powered content generation in assessment include:
- Rapid creation of diverse assessment items across various domains, skill levels, and formats
- Difficulty calibration through carefully crafted questions and problems, allowing us to target different levels of difficulty
- Domain-specific scenario generation that simulates real-world workplace challenges with authentic context
- Culturally sensitive and globally relevant content that minimizes geographic or cultural biases
- Automatic distractor generation for multiple-choice assessments with plausible yet clearly incorrect options
Our research demonstrates that LLM-generated assessment content, when properly validated using psychometric techniques, achieves comparable or superior item discrimination and reliability metrics compared to traditionally developed assessments. Furthermore, the ability to rapidly iterate and refine content allows for continuous improvement of assessment quality.
The integration of human expertise with LLM capabilities creates a powerful synergy—psychometricians can focus on validation and refinement rather than initial content creation, dramatically accelerating the assessment development lifecycle while maintaining scientific rigor.
The Future of AI-Driven Assessments: Ongoing Research and Development
Our work is far from complete. As AI continues to evolve, so do our methods for refining and validating automated scoring algorithms. Key areas of ongoing research include:
- Adapting Gen-AI models for multilingual skill assessments to ensure fairness across linguistic groups
- Developing adaptive testing mechanisms where algorithms dynamically selects test items based on real-time performance
- Expanding beyond textual analysis into conversational speech and multimodal assessments, where AI evaluates not just written responses but also spoken communication in a conversational-like manner
- Investigating Gen-AI's ability to provide real-time, personalized feedback to enhance learning outcomes
At Maki People, we are pioneering a new paradigm for skill assessment—one where AI scoring is not just automated but scientifically validated and psychometrically rigorous. Our combination of AI and psychometric modeling ensures that skill assessments are fair, accurate, and scalable, helping employers, educators, and candidates make data-driven decisions in an evolving workforce.
- Gen-AI utilises large language models (LLMs) to evaluate complex, open-ended responses that demonstrate job-relevant competencies.
- Gen-AI leverages LLMs and natural language processing (NLP) to assess power skills crucial for workplace success.
- Gen-AI applies consistent scoring criteria across thousands of applications without fatigue or bias.
- Gen-AI analyzes candidate reasoning in real time, providing immediate insights to hiring teams.
- Gen-AI employs LLMs to generate content on the fly for bespoke client-specific tests.