Can Machines Govern Machines? What Emerging Research Means for Enterprise Compliance
Jun 1, 2026
Article by

Introduction
The rapid growth of artificial intelligence and data privacy laws has created an important question for businesses today, can AI systems help govern and monitor other AI systems? A 2025 research paper by Superset Labs, titled “Can We Trust AI to Govern AI?”, explored this idea by testing ten leading large language models (LLMs) from companies such as OpenAI, Anthropic, Google DeepMind, Meta, and DeepSeek. These models were evaluated using four globally recognised privacy and AI governance certification exams conducted by the International Association of Privacy Professionals (IAPP): CIPP/US, CIPM, CIPT, and AIGP.
The results were remarkable. Advanced models like Gemini 2.5 Pro and GPT-5 not only passed these certification exams but scored above the minimum marks usually required for human professionals. This becomes highly relevant as organisations across the world face growing compliance obligations under laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), the EU AI Act, and India’s Digital Personal Data Protection Act (DPDPA). The study suggests that AI systems are becoming capable enough to assist organisations in handling complex compliance and governance responsibilities.
At the same time, the research also highlights an important reality, passing exams does not automatically mean AI can independently manage legal and compliance decisions in real-world situations. While these tools may support tasks like compliance monitoring, policy drafting, documentation, and risk assessment, there are still concerns about accuracy, accountability, bias, and over-reliance on automated systems. For enterprise compliance leaders and privacy professionals, the real challenge is not deciding whether to use AI but understanding how to integrate it responsibly while ensuring that human oversight remains at the centre of decision-making.
The Research Landscape: LLMs and Privacy Governance Exams
Before interpreting the results, it is worth understanding what was actually tested. The IAPP certifications used in the study represent the recognised gold standard for privacy and AI governance expertise:
CIPP/US - Covers U.S. privacy law, including HIPAA, CCPA, GLBA, COPPA, and FTC enforcement frameworks. It is the foundational credential for privacy attorneys and compliance officers operating in U.S.-regulated environments.
CIPM - Focuses on the operationalisation of privacy within organisations: program governance, risk management, breach response, and vendor oversight.
CIPT - Targets privacy engineers and technologists. It examines de-identification, encryption, access control, privacy-by-design, and emerging privacy technologies.
AIGP - The newest credential, addressing AI ethics, bias mitigation, AI risk management, regulatory compliance (including the EU AI Act and the NIST AI RMF), and responsible AI governance.
Each exam contains between 90 and 100 multiple-choice questions. The IAPP uses a weighted scoring scale out of 500, where 300 is a passing score, roughly equivalent to answering 66-83 percent of questions correctly. The research team tested all ten models using official IAPP sample exams in a zero-shot, closed-book setting, meaning the models relied solely on their pre-trained knowledge with no access to reference materials.
This method was designed to keep the testing fair and realistic. The AI models were tested in the same way as human candidates, which makes the results a trustworthy way to compare AI performance with human professional standards.
What the Benchmarks Reveal: Key Findings
The aggregate results paint a clear picture, most frontier LLMs have already surpassed the threshold for professional human certification in privacy law, program management, technical privacy, and AI governance. The table below summarises aggregate scores across all four exams:
Model | Aggregate | ||||
Gemini 2.5 Pro | 90.0% | 92.2% | 92.2% | 93.9% | 92.1% |
GPT-5 (OpenAI) | 93.3% | 90.0% | 90.0% | 91.9% | 91.3% |
DeepSeek-R1 | 86.7% | 90.0% | 92.2% | 91.9% | 90.2% |
Claude 3.7 Sonnet | 91.1% | 91.1% | 80.0% | 91.9% | 88.1% |
Gemini 1.5 Pro | 87.8% | 87.8% | 86.7% | 92.9% | 88.9% |
GPT-5-Mini (OpenAI) | 87.8% | 88.9% | 85.6% | 92.9% | 88.9% |
84.4% | 90.0% | 84.4% | 90.9% | 87.5% | |
Meta-LLaMA-3-70B | 81.1% | 84.4% | 83.3% | 87.9% | 84.3% |
Claude 3.5 Haiku | 80.0% | 82.2% | 83.3% | 84.8% | 82.7% |
Meta-LLaMA-3-8B | 63.3% | 57.8% | 66.7% | 72.7% | 65.3% |
Table 1: Aggregate model performance across all four IAPP certification exams. Scores above 83.3% represent a 'definite pass' against the human threshold.
Eight of ten models exceed the 83.3% 'definite pass' threshold on the AIGP exam. The pattern is consistent: larger, more capable proprietary models lead, while smaller open-weight models trail though even open-source alternatives like Gemma-3-27B-IT perform remarkably well relative to their size and cost.
Where LLMs Excel and Where They Fall Short
Strengths: Legal Knowledge and AI Governance
The research shows that LLMs are performing at a level close to human experts in several areas. In the CIPP/US exam, which focuses on U.S. privacy laws, all advanced AI models scored above the passing mark, with GPT-5 achieving an impressive 93.3%. The models performed especially well in topics like government and court access to private-sector data and workplace privacy, mainly because these areas are based on well-established legal principles that are widely available in training data.
The AIGP exam covering AI ethics, the EU AI Act, NIST AI RMF, bias mitigation, and incident response produced the highest aggregate scores of any exam (average: 89% across all models). This is an important signal for enterprises with frontier LLMs have meaningfully internalised the principles of responsible AI governance that regulators are increasingly codifying into law.
Weakness: Emerging and Technical Privacy Areas
The CIPT exam showed that AI models still struggle in some advanced privacy areas. While they performed well in basic privacy principles and privacy by design, they scored lower in topics related to emerging privacy technologies and privacy-enhancing tools like differential privacy and federated learning. The study also found that Anthropic models, despite being known for strong coding abilities, performed weaker than some other leading models in this exam. This highlights that technical privacy expertise requires not just coding skills, but also a strong understanding of law, ethics, and privacy governance.
The CIPM Gap: Why Programme Management Lags
The most important finding for enterprise compliance leaders may be the one that receives the least attention, the structural weakness of LLMs in privacy programme management. The correlation matrix in the study reveals that CIPM scores have a Pearson correlation of just 0.24 with CIPT scores and 0.5 with AIGP scores, the lowest cross-exam correlation in the study. By contrast, CIPT and AIGP scores correlate at 0.91, and CIPP/US and AIGP at 0.93.
This shows that AI models that perform well in legal and technical privacy topics may still struggle with privacy programme management. The CIPM exam focuses on areas like governance frameworks, vendor risk management, breach response, and handling privacy operations across organisations tasks that require practical judgment and real-world decision-making. Because of this, LLMs should not be seen as direct replacements for experienced privacy managers, especially in complex compliance situations involving third-party risks or regulatory pressure. The researchers suggest that these gaps could improve if AI models are trained on management-focused case studies and standards like ISO/IEC 27701. For companies using AI-based compliance tools, an important question is whether the model has been specially trained for privacy management tasks or only relies on general training data.
Implications for Enterprise Compliance Programmes
AI as a Force Multiplier, Not a Replacement
The research shows that top AI models now have knowledge levels similar to certified privacy professionals in many areas. However, the authors also make it clear that passing an exam is not the same as handling the full responsibilities of a real compliance professional. Important skills like practical judgment, managing stakeholders, understanding business context, and dealing with regulators cannot be measured through multiple-choice tests.
This means AI should be seen as a support tool, not a replacement for human experts. LLMs can help by drafting privacy policies, creating compliance checklists, identifying possible gaps in regulations, and answering basic employee questions about data privacy. But they should not make major legal or compliance decisions on their own, such as deciding whether a data breach must be reported, advising on cross-border data transfers, or building an organisation’s privacy governance framework.
Regulatory Risk: What Compliance Leaders Must Watch
Laws like the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), India’s Digital Personal Data Protection Act (DPDPA), and the EU AI Act make organisations responsible for compliance, not the AI tools they use. This means that if an AI-generated compliance suggestion turns out to be wrong, the legal responsibility still lies with the company and its privacy or AI governance team.
Because of this, organisations need to carefully manage several risks while using LLMs in compliance work. AI models can sometimes give confident but incorrect legal answers, especially in complex or fast-changing areas of privacy law. Another challenge is that privacy regulations are evolving quickly, and some models may not have updated knowledge of recent laws, enforcement actions, or regulatory guidance. The study also focused mainly on U.S.-based certifications, so good performance in U.S. privacy law does not automatically mean strong understanding of European or other international privacy frameworks. In addition, even high-performing models may still have weaknesses in specific topics, which means companies cannot rely only on overall scores while making important compliance decisions.
Open-Source vs. Proprietary: The Governance Calculus
One of the most practical findings from the research was related to open-weight AI models. Google DeepMind’s Gemma-3-27B-IT, an open-source model, scored 87.5% overall and passed all four certification exams. This is important for organisations that cannot use third-party AI tools because of data privacy or data localisation requirements.
The study suggests that companies may still be able to use open-source AI models for tasks like legal research, answering compliance questions, and drafting privacy policies without losing much accuracy. However, unlike third-party AI services, organisations using open-weight models must handle responsibilities like model updates, security management, and performance monitoring on their own.
Practical Guidance for Privacy and Compliance Leaders
Use LLMs for High-Volume, Well-Defined Tasks
The areas where LLMs demonstrate the most reliable competence legal knowledge, AI governance principles, foundational privacy frameworks align well with high-volume, lower-stakes compliance tasks. Automating first-draft privacy notices, generating Data subject access request (DSAR) response templates, mapping data flows to applicable regulatory obligations, and conducting initial vendor due diligence questionnaires are all strong candidates for LLM-assisted automation.
Build Human Review into Every High-Stakes Workflow
For any compliance output that is externally communicated, regulatory-facing, or legally consequential, LLM-generated content should undergo mandatory human review by a qualified privacy professional. This is not merely a best practice, it is increasingly a regulatory expectation under the accountability principle articulated in GDPR Article 5(2) and operationalised through Data Protection Impact Assessment (DPIA) requirements.
Benchmark Your AI Tools Against Professional Standards
The study's methodology using IAPP certification exams as benchmarks offers a replicable framework for enterprise AI procurement decisions. Before deploying any LLM-based compliance tool, organisations should test the model's performance against domain-specific questions drawn from the relevant regulatory frameworks governing their operations. Generic marketing claims about AI capability are not a substitute for targeted, domain-specific evaluation.
Prioritise CIPM-Equivalent Capability Assessment
Given the research finding that CIPM-equivalent knowledge (privacy programme management) is the most differentiated and least correlated with general model capability, compliance leaders should specifically test shortlisted LLM tools on programme management scenarios: breach response decision trees, vendor risk tiering methodologies, data retention governance, and privacy operational lifecycle questions.
Develop an AI Governance Policy for Compliance AI Tools
Enterprises using AI tools to support compliance functions should themselves apply AI governance standards to those tools. This means documenting the model's intended use cases, known limitations, human oversight mechanisms, performance monitoring cadences, and update procedures. Under the EU AI Act's risk classification framework, AI systems used in legal compliance contexts may warrant elevated governance treatment.
From Research to Reality: How GoTrust is operationalizing AI-driven Compliance
DPO Copilot: A Living Privacy Agent covers the April 2025 launch, its adaptive AI approach, and maps its five core capabilities (data discovery, consent lifecycle, DSR handling, breach preparedness, regulatory policy updates) directly to the CIPM programme-management gap identified in the study
AI Governance: Bridging the AIGP Knowledge Gap in Practice: positions GoTrust's ISO/IEC 42001 module as the operational layer that converts LLMs' strong theoretical AI governance knowledge into enterprise infrastructure DSPM, bias monitoring, audit dashboards, multi-framework compliance.
Compliance Automation: Closing the Programme Management Gap: ties GoTrust's automation platform directly to the CIPM correlation finding (r = 0.24), showing how it fills exactly the gap frontier models can't bridge
Data Discovery, DSPM, and Vendor Risk: addresses the key LLM limitation: they can answer governance questions but can't see inside an organisation's data environment
Conclusion
The Superset Labs benchmark study arrives at a moment when enterprises are simultaneously under pressure to scale compliance operations and to demonstrate responsible AI governance. Its findings confirm that frontier LLMs have crossed the threshold of professional-grade knowledge in privacy law, technical privacy, and AI governance. They have not yet crossed the threshold of reliable, context-sensitive programme management judgment.
For privacy officers, compliance leads, and AI governance professionals, the practical takeaway is nuanced. LLMs are a powerful force multiplier for high-volume, well-defined compliance tasks drafting, mapping, classification, and question-answering. They are not yet reliable autonomous agents for high-stakes, context-dependent compliance decisions that require organisational judgment, stakeholder accountability, and regulatory relationship management.
The enterprises best positioned to benefit from AI-assisted compliance are those that approach LLM deployment with the same rigour they apply to any other data processing activity: with a clear purpose, defined human oversight, documented accountability, and regular performance review. That, in essence, is privacy by design applied to the tools of privacy itself.
As the regulatory landscape continues to evolve and as the AI models that underpin compliance tooling continue to improve the question for enterprise leaders is not whether to engage with AI governance AI. It is how to do so responsibly, at scale, and in alignment with the accountability principles that define trustworthy data practice.




