K-Anonymity, L-Diversity, and T-Closeness: Which One Actually Protects Your Dataset?

Apr 30, 2026

Article by

Introduction

The most critical aspect in the present scenario is the need to protect individual privacy. Organisations across all the domains regularly share data for research, healthcare, public administration and commercial analytics as well as in OTT platforms. From medical or financial records to behaviour analytics and recommendations or viewing history. Hence, individuals can easily be traced back with certain exposure to evolved harms. Therefore, the risk is clear in the present scenario.

The major boon for re-identification of these risks is privacy-preserving techniques. There are three statistical privacy models: K-Anonymity, L-Diversity and T-Closeness. These three form a comprehensive framework for quantifying individual privacy, including tabular datasets.

This blog covers how each privacy model works, their operational limitations, and what will be the perspective to be made by organisations should take while choosing any of these models, aligning with current regulatory landscape under frameworks like GDPR and India’s Digital Personal Data Protection Act, 2023 (DPDPA).

Understanding K-Anonymity

In 2002, Latanya Sweeney introduced K-Anonymity, which was established as one of the landmarks in privacy engineering. K-Anonymity describes a situation in when the record of each individual cannot be distinguished from at least k-1 other records based on quasi-identifiers in the dataset. Quasi-identifiers are attributes which can be only identified based on certain combinations, such as age, gender, zip code.

To achieve k-anonymity, two primary techniques are used: generalisation, which replaces specific values with broader categories (for example, replacing the exact age "34" with the range "30–39"), and suppression, which removes certain values entirely (replacing them with an asterisk or a null value). The higher the value of k, the stronger the privacy protection. But there will be greater potential loss of data utility.

Consider a hospital patient dataset with attributes such as Age, ZIP Code, Gender, and Disease. If k = 4, every combination of Age, ZIP Code, and Gender must appear in at least four records before the dataset can be released. This makes it harder for an attacker to single out a specific individual from the published data.

From a regulatory standpoint, k-anonymity is often considered a baseline anonymisation technique. When the datasets does not have any scope of re-identification, then these anonymized datasets fall outside the scope of data protection obligations under GDPR and DPDPA.

The Operational Limitations of K-Anonymity

Despite giving the sufficient anonymization, K-Anonymity has certain operational limitations when it is used in isolation.

The limitation of homogeneity happens when all records within a k-anonymous group share the same value for a sensitive attribute. Imagine a 4-anonymous dataset where every patient in a group having record of age and pin code has been diagnosed with the same disease. Even though an attacker cannot pinpoint which individual is Paul, they can conclude with certainty that Paul has that disease, since every record in his group shares it.

The limitation of Background Knowledge Attack is equally dangerous. If an attacker has a background knowledge of a particular individual that an individual has a low risk of certain kind of symptoms, and the k-anonymous group includes that disease as one of the few possible values, the attacker can narrow down the sensitive attribute with high confidence.

At a practical level, the more quasi-identifiers a dataset has, the harder it is to achieve even moderate k values without significant data loss. Hence, if more attributes are included, the harder it will get for anonymization without any dataset’s usefulness. This is known as "curse of dimensionality."

L-Diversity: Plugging the Gaps K-Anonymity Left Open

To address these operational limitations of K-Anonymity, L-Diversity was proposed by Machanavajjhala in 2006. Its core principle: each equivalence class (a group of records with identical quasi-identifier values) must contain at least l "well-represented" values for the sensitive attribute. In other words, the group you belong to must also be sufficiently diverse in terms of what sensitive information it reveals.

There are three interpretations of "well-represented" in the L-diversity framework:

Distinct L-Diversity: It is the simplest form, where each group must contain at least l distinct sensitive values. This prevents homogeneity but does not account for skewed distributions.
Entropy L-Diversity: A stricter version that requires the entropy of the sensitive attribute distribution within each group to be at least log(l). This ensures that no single value dominates the group.
Recursive (c, l)-Diversity: A balanced approach that ensures the most frequent sensitive value does not appear too disproportionately relative to the rest. It offers more flexibility than entropy l-diversity and can preserve more data utility.

L-diversity meaningfully strengthens privacy by ensuring that even if an attacker knows which group a person belongs to, they cannot infer the sensitive attribute with high confidence. For datasets containing sensitive information such as medical diagnoses, financial information, or HR records, l-diversity is a significant improvement.

However, l-diversity is not without its own operational limitations. It remains vulnerable to two important attacks: the skewness attack, where the distribution of sensitive values within a group is very different from the overall dataset distribution (so an adversary with population knowledge can still make inferences), and the similarity attack, where all values in a group are semantically close, for example, all diseases in a group are types of cancer, which leaks sensitive information even when the values are technically distinct.

T-Closeness: The Distribution-Aware Standard

T-Closeness was introduced by Ninghui Li et al. in 2007 as a direct response to the shortcomings of l-diversity. It ensures that the distribution of sensitive values within each group closely mirrors the distribution of sensitive values across the entire dataset.

Formally, an equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in that class and the distribution in the overall table is no more than a threshold t. This distance is typically measured using the Earth Mover's Distance (EMD), the minimum "work" required to transform one distribution into another.

As an example: suppose a dataset contains Age and ZIP Code as quasi-identifiers, and Income as a sensitive attribute. With t = 0.1, the distribution of income values within any group of individuals sharing the same age and ZIP code must be within 10% of the income distribution across the whole dataset. This prevents an attacker from inferring that a particular group earns disproportionately high or low income even if the group contains diverse income values.

T-closeness provides the strongest semantic protection of the three models. It defends against both skewness and similarity attacks, which l-diversity cannot. It preserves utility with higher t thresholds relax, while lower thresholds tighten privacy at the cost of data granularity.

Importantly, t-closeness aligns most closely with the way modern data protection regulators think about re-identification risk. Under the GDPR's Recital 26, data can only be considered anonymous if re-identification is not "reasonably likely" pass a test that implicitly requires assessing what an attacker with population-level knowledge might infer. T-closeness directly limits the information where an observer or an attacker can achieve from the published dataset.

Why Automated Implementation Is Essential

Real-world datasets have evolved and updated continuously, which is shared across departments and vendors, and processed at a scale that makes manual anonymisation unworkable. Applying k-anonymity, l-diversity, or t-closeness manually is neither scalable nor reliable.

Automation is necessary for several reasons:

Dynamic data environments require that anonymisation parameters be re-evaluated each time a dataset changes. A threshold that worked last month may no longer hold as new records are added.
Purpose-linked anonymisation means different use cases, such as a public research release versus an internal audit. It may warrant different k, l, or t values. Automated systems can enforce context-specific policies consistently.
Audit trails and re-identification risk scoring must be applied uniformly across all systems, including those operated by third-party vendors or data processors. Automated logging ensures that every data access is traceable.
While GDPR's Article 25 (Privacy by Design) explicitly mandates Privacy by Design and by Default, the DPDPA’s Section 8(4) and DPDPA's Rule 6 together mentions appropriate technical and organizational measures to be embedded into data processing systems.

Tools like ARX (an open-source anonymisation platform) and cloud-based data protection services offer automated implementation. However, organisations processing sensitive data at large scale, such as government agencies, healthcare providers, financial institutions, they need enterprise-grade compliance layers that integrate these privacy models into their data pipelines automatically, with full audit capability.

Choosing the Right Model for Your Dataset

No single privacy model is universally superior. The right choice depends on the nature of the sensitive data, the threat model, the regulatory context, and the acceptable level of data utility loss. The following table summarises the key differences:

Parameter

K-Anonymity

L-Diversity

T-Closeness

Core Guarantee

Identity cannot be distinguished

Sensitive attribute diversity

Distribution-based protection

Protects Against

Re-identification by quasi-identifiers

Homogeneity & background knowledge attacks

Skewness & similarity attacks

Key Weakness

Homogeneity & background knowledge attacks

Skewness & similarity attacks

High utility loss; complex to implement

Complexity

Low

Medium

High

Utility Loss

Low to Medium

Medium

Medium to High

Best Suited For

Large public datasets, basic de-identification

Medical/HR data with multiple sensitive values

Datasets shared with sophisticated adversaries

Regulatory Fit (GDPR/DPDPA)

Baseline compliance aid

Stronger anonymisation standard

Closest to GDPR's re-identification risk test

As a practical decision framework:

Use K-Anonymity as a baseline for large public datasets where the adversary is unsophisticated and the sensitive attributes are well distributed. It is the simplest to implement and explains easily to non-technical stakeholders.
Upgrade to L-Diversity when the dataset contains multiple sensitive attributes or when there is any risk of homogeneity within anonymised groups.
Apply T-Closeness when the dataset will be shared with parties who have significant background knowledge, or when regulatory expectations require demonstrating that group-level distributions do not reveal individual-level information.

It is also worth noting that these models are complementary, not mutually exclusive. A dataset can simultaneously satisfy k-anonymity, l-diversity, and t-closeness, with each model providing an additional layer of assurance.

Aligning Data Sharing with Privacy by Design

To order establish digital trust within an organization, choosing the right anonymisation model is not merely a technical decision but it reflects an organisation's broader privacy posture. Under both the GDPR and the DPDPA, privacy is a design requirement. Data protection authorities have always told that anonymization is not necessary to prevent re-identification attempts.

Hence, Privacy by Design is a necessity to embed these designs into data architecture by selecting the right model for the right use and maintaining privacy posture in an organization. Digital trust among the data subjects like patients, citizens, employees, can only grow when organization will carry their data with genuine care and transparent manner.

From Anonymisation Models to Implementation: GoTrust’s Platform

Statistical privacy models such as K-Anonymity, L-Diversity, and T-Closeness provide strong theoretical guarantees, but their effectiveness depends on consistent and scalable implementation. GoTrust’s data privacy and governance platform bridges the gap between anonymisation theory and implement data operations by embedding privacy controls directly into data ecosystems.

Automated Data Discovery & Classification
GoTrust enables organisations to discover and classify personal and sensitive data across structured and unstructured systems, including databases, cloud storage, and SaaS environments. By identifying quasi-identifiers and sensitive attributes, organisations can systematically apply K-Anonymity, L-Diversity, and T-Closeness models with contextual accuracy.
Anonymisation Workflow Orchestration
The platform supports automated anonymisation workflows, including generalisation, suppression, and risk-based transformations aligned with statistical privacy models. This ensures that anonymisation is not a one-time activity but a continuous, policy-driven process adaptable to dynamic datasets and evolving use cases.
Re-Identification Risk Assessment & Monitoring
GoTrust integrates risk scoring mechanisms to evaluate re-identification exposure across datasets. By simulating adversarial scenarios and measuring distributional risks, organisations can validate whether applied models meet acceptable thresholds under frameworks such as GDPR Recital 26 and Digital Personal Data Protection Act, 2023.
Privacy-by-Design & Audit-Ready Compliance
Through automated logging, audit trails, and policy enforcement, GoTrust enables organisations to demonstrate privacy-by-design and privacy-by-default obligations. This ensures that anonymisation decisions, parameter selections (k, l, t), and data transformations remain traceable, reviewable, and defensible in regulatory audits.
Third-Party & Data Sharing Governance
As datasets are increasingly shared across vendors and partners, GoTrust provides governance controls to enforce anonymisation standards consistently across third-party environments. This reduces downstream re-identification risks and strengthens accountability across the data lifecycle.

Conclusion

K-Anonymity, L-Diversity, and T-Closeness each represent a meaningful advance in the science of privacy-preserving data publication. K-anonymity established the foundational principle that individuals should be indistinguishable within a group; l-diversity extended this to require meaningful variation in sensitive attributes; t-closeness took the most sophisticated view by requiring that group distributions mirror the population.

None of these models is perfect, and none offers a blanket guarantee of anonymity. As data protection regulators under both the GDPR and the DPDPA have made clear, achieving true anonymisation is contextual, risk-based, and ongoing. K-anonymity is a baseline, not a ceiling. L-diversity addresses its most glaring weaknesses. T-closeness comes closest to the re-identification risk test that modern regulators apply.

It is the need of the hour for organisations to think about the privacy culture and protect the data retrospectively. The question is not merely which model to choose, but whether the choice is appropriate to the data, the threat environment, and the regulatory expectations that govern it. Building the right privacy model into data systems by design is not just a compliance obligation. It is the foundation of trustworthy data sharing.

info@gotrust.tech

India

Noida

303, Tower C, ATS Bouquet, Noida Sector 132, U.P.

mumbai

1st Floor, Raheja Platinum, WeWork, K, Marol, Andheri East, Mumbai, Maharashtra 400059

UAE

DIFC Innovation Hub, Gate Avenue, Zone D, Co-working Space Level 1 Al Mustaqbal St, Dubai

Netherlands

Cuserpark Amsterdam, De Cuserstraat 91, 1081CN, Amsterdam, Netherlands