Read our manifesto and Series A announcement
Back
Science
March 8, 2025

Building Transparent AI

At Maki People, we believe AI should empower, not obscure. In an era where AI-driven assessments shape careers and businesses, trust is not just an ethical stance—it’s a business imperative. Here’s why transparent AI matters and how we are actively working towards achieving it, ensuring companies and candidates can make informed, fair, and confident decisions.

AI You Can Trust: The Business and Candidate Case for Transparency

Why it matters: Transparent AI isn’t just an ethical choice; it’s a competitive advantage. Businesses need to trust the tools they use to assess talent, and candidates deserve clarity on how decisions are made. At Maki People, we are committed to developing AI that provides insights into decision-making, ensuring hiring teams can rely on data-driven evaluations while candidates receive accurate, objective, and fair scores that reflect their true potential.

Our progress:

  • Evaluating AI predictions against expert human ratings to ensure alignment and accuracy
  • Conducting internal audits to refine decision-making transparency
  • Continuously evaluating and selecting optimal foundation models to ensure reliable and accurate scoring procedures
  • Working towards explainable AI models that will provide clearer reasoning behind candidate evaluations

Ensuring Scoring Accuracy

Our commitment to precision in candidate evaluation drives everything we do. We've transitioned from traditional single-choice questions to more insightful open-ended formats, conducting extensive research to identify the most accurate AI technologies for understanding candidate responses.

Our research team has rigorously tested multiple semantic analysis approaches including BERT-based models, MiniLM-L6, MPNet, BGE embeddings, and OpenAI's GPT-4 models to identify the optimal technologies for accurately assessing candidate open ended responses. This comprehensive testing has revealed clear performance differences between technologies, guiding our implementation decisions to maximize assessment accuracy. For example, the sentence embedding approach using models like Multilingual MPNet shows significant improvements over traditional word-based methods, achieving up to 67% accuracy with a natural distribution of similarity scores that better discriminates between response qualities. Unlike word embeddings that produce artificially high similarities across almost all responses, sentence embeddings show a more realistic evaluation pattern, allowing for more meaningful differentiation. For even greater accuracy, our testing of OpenAI's GPT models demonstrated impressive performance with 81-82% accuracy and excellent precision (95%) for identifying incorrect answers. This represents a substantial improvement over baseline approaches. Our technical implementation includes robust error handling, automated retries, and progress tracking to ensure reliable results regardless of the model used.

The technical advantages of our approach are clear: significantly higher precision compared to traditional methods, more accurate identification of truly similar responses, and assessment results that better align with expert human judgment. Our implementation provides flexibility to balance accuracy, speed, and cost based on specific assessment needs, ensuring you receive the most precise evaluation of candidate capabilities possible.

We continue to refine our models through ongoing research and validation, maintaining our commitment to scientific rigor in AI assessment. This means more meaningful insights and fairer evaluations for all candidates, regardless of background or response style, delivered through technology that continues to evolve with the latest advancements in AI.

On top of the model evaluations we conduct for Situational Judgment Tests, we also extensively test language proficiency assessment capabilities across LLM models with varying parameters to ensure evaluation robustness. Our comparative analysis of seven OpenAI models, Claude models and DeepSeek across multiple prompts achieved up to 90% exact match rate in CEFR level predictions (A1-C2). This rigorous testing across different use cases ensures our assessment technology delivers consistent accuracy regardless of the skill being evaluated.

Beyond Accuracy: Striving for Fair and Bias-Minimized AI

Why it matters: AI-driven hiring should level the playing field, not reinforce biases. While accuracy is crucial, fairness is equally important. Our research have evaluated the reliability, validity, and fairness of LLM's assessments and compared the AI-scores to human raters across soft skills like result orientation, team orientation, and adaptability. The findings demonstrated strong alignment between AI and human evaluation, with intraclass correlation coefficients showing good to excellent agreement. Bland-Altman analyses confirmed minimal systematic bias, proving that we're actively testing our models to ensure assessments measure potential. Statistical differences between specific demographic groups were small (typically less than one point) rather than meaningful practical differences.

Our recent validation study examined how our AI evaluates three critical soft skills: result orientation, team orientation, and adaptability. The study compared AI evaluations against multiple qualified industrial-organizational psychologists using the a structured grid scoring criteria.

Key Performance Metrics

  • Strong Agreement with Human Experts:
    • Result Orientation: 91% agreement between AI and human expert consensus
    • Team Orientation: 92% agreement between AI and human expert consensus
    • Adaptability: 89% agreement between AI and human expert consensus
  • Correlation with Human Evaluators:
    • AI evaluations showed strong correlations with human expert ratings:
      • Result Orientation: r = 0.86
      • Team Orientation: r = 0.78
      • Adaptability: r = 0.73

Demographic Fairness Testing

We rigorously evaluated the LLMs for potential bias across demographic groups such as:

  • Gender Analysis: Statistical testing found no significant scoring differences between male and female candidates
  • Age Evaluation: No significant differences in scores between younger and older candidates
  • Systematic Bias Assessment: Bland-Altman analyses indicated minimal systematic bias in AI evaluations

Our Continuous Improvement Process

We remain committed to fairness and transparency by:

  • Expanding our training data to represent diverse candidate pools
  • Iteratively improving our scoring methods
  • Including more demographic groups for the evaluation of fairness testing
  • Validating assessments grounded on established statistical principles and psychometric techniques

This transparent approach ensures that candidates are evaluated based on their true abilities, not irrelevant factors, while allowing organizations to benefit from the efficiency and scalability of AI-assisted evaluation.

Compliance and Ethics: Staying Ahead of AI Regulations

Why it matters: Regulatory frameworks around AI in hiring are evolving fast. Companies that fail to adopt transparent AI risk compliance issues and legal consequences. At Maki People, we are proactively aligning with global AI ethics standards and collaborating with independent auditing companies to continuously improve our compliance practices.

Our commitment to fairness has been assessed through an independent audit by Holistic AI, a respected AI governance platform. Their comprehensive assessment of several of our tests delivered encouraging results: all demographic groups evaluated were treated fairly in the hiring process. When applying the standard four-fifths rule measurement—a key metric for identifying potential hiring discrimination—they found no concerning differences in selection rates between different demographic categories. This held true whether examining individual characteristics or intersectional combinations of characteristics, confirming our assessments meet rigorous fairness standards. (Please refer to this page for more information)

Our ongoing efforts:

  • We subject our AI assessment models to independent audits and continue to enhance our fairness metrics, ensuring our technology remains unbiased and compliant with evolving regulations as we scale our platform.
  • Achieved ISO 27001 certification, demonstrating our commitment to information security and regulatory compliance
  • Aligning with EU AI Act, GDPR, EEOC, and emerging AI hiring regulatory frameworks
  • Enhancing internal AI documentation to facilitate compliance audits
  • Regularly reviewing AI ethics best practices to ensure responsible development
  • Strengthening candidate privacy and data protection protocols

Implementing Human-in-the-Loop Validation Systems

Why it matters: While AI offers scalability and consistency, human expertise remains essential for maintaining assessment quality. Our research shows that combining AI capabilities with human oversight creates a more robust evaluation system than either approach alone.

Our technical approach:

  • Structured Validation Protocol: We've implemented a systematic human review process where human experts will evaluate a significant sample (~10%) of AI-scored assessments moving forward to ensure ongoing accuracy and identify any potential scoring patterns that require calibration.
  • Edge Case Detection: Our monitoring system automatically flags unusual scoring patterns and edge cases for expert review, improving overall system accuracy by identifying cases where AI confidence is lowest
  • Continuous Learning Framework: We maintain a feedback loop where human reviewer corrections are documented
  • Audit Documentation Infrastructure: Our proprietary tracking system maintains comprehensive records of all assessment decisions, enabling detailed fairness analyses across demographic factors and providing complete transparency for compliance requirements

This human-in-the-loop approach complements our AI technology, ensuring that while we leverage automation for efficiency, human judgment remains the foundation of our assessment methodology.

Final Thoughts: The Future of AI in Hiring Starts with Trust

At Maki People, transparency isn’t just a feature—it’s a journey. We are continuously improving our AI models to enhance explainability, fairness, and compliance while maintaining human oversight. By prioritizing responsible AI development, we empower companies to hire with confidence and integrity while ensuring candidates gain valuable insights into their potential for career growth. In a world where AI is reshaping recruitment, trust is the currency of success.

Related resources