Fairness in Machine Learning

NIPS 2017 Tutorial — Part I

Solon Barocas and Moritz Hardt

Slides available at
"[H]iring could become faster and less expensive, and […] lead recruiters to more highly skilled people who are better matches for their companies. Another potential result: a more diverse workplace. The software relies on data to surface candidates from a wide variety of places and match their skills to the job requirements, free of human biases."

Miller (2015)
"But software is not free of human influence. Algorithms are written and maintained by people, and machine learning algorithms adjust what they do based on people’s behavior. As a result […] algorithms can reinforce human prejudices."

Miller (2015)

Bias as a technical matter

Selection, sampling, reporting bias

Bias of an estimator

Inductive bias

Of course, these raise ethical issues, too

Isn’t discrimination the very point of machine learning?

Unjustified basis for differentiation

Practical irrelevance

Moral irrelevance

Discrimination is not a general concept

It is domain specific

Concerned with important opportunities that affect people’s life chances

It is feature specific

Concerned with socially salient qualities that have served as the basis for unjustified and systematically adverse treatment in the past

Regulated domains

  • Credit (Equal Credit Opportunity Act)
  • Education (Civil Rights Act of 1964; Education Amendments of 1972)
  • Employment (Civil Rights Act of 1964)
  • Housing (Fair Housing Act)
  • ‘Public Accommodation’ (Civil Rights Act of 1964)

Extends to marketing and advertising; not limited to final decision

This list sets aside complex web of laws that regulates the government

Legally recognized ‘protected classes’

Race (Civil Rights Act of 1964); Color (Civil Rights Act of 1964); Sex (Equal Pay Act of 1963; Civil Rights Act of 1964); Religion (Civil Rights Act of 1964); National origin (Civil Rights Act of 1964); Citizenship (Immigration Reform and Control Act); Age (Age Discrimination in Employment Act of 1967); Pregnancy (Pregnancy Discrimination Act); Familial status (Civil Rights Act of 1968); Disability status (Rehabilitation Act of 1973; Americans with Disabilities Act of 1990); Veteran status (Vietnam Era Veterans' Readjustment Assistance Act of 1974; Uniformed Services Employment and Reemployment Rights Act); Genetic information (Genetic Information Nondiscrimination Act)

Discrimination Law: Two Doctrines

Disparate Treatment




Disparate Impact




Disparate Treatment

Formal: explicitly considering class membership

Even if it is relevant

Intentional: purposefully attempting to discriminate without direct reference to class membership

Pretext or ‘motivating factor’

Disparate Impact

1. Plaintiff must first establish that decision procedure has a disparate impact

‘Four-fifths rule’

2. Defendant must provide a justification for making decisions in this way

‘Business necessity’ and 'job-related’

3. Finally, plaintiff has opportunity to show that defendant could achieve same goal using a different procedure that would result in a smaller disparity

‘Alternative practice’

What does discrimination law aim to achieve?

Disparate Treatment

Procedural fairness

Equality of opportunity

Disparate Impact

Distributive justice

Minimized inequality of outcome

Non-discrimination, equality of opportunity, and equality of outcome

Narrow notions of equality of opportunity are concerned with ensuring that decision-making treats similar people similarly on the basis of relevant features, given their current degree of similarity

Non-discrimination, equality of opportunity, and equality of outcome

Broader notions of equality of opportunity are concerned with organizing society in such a way that people of equal talents and ambition can achieve equal outcomes over the course of their lives

Non-discrimination, equality of opportunity, and equality of outcome

Somewhere in between is a notion of equality of opportunity that forces decision-making to treat seemingly dissimilar people similarly, on the belief that their current dissimilarity is the result of past injustice

Tension between disparate treatment and disparate impact

Ricci v. DeStefano

Texas House Bill 588

The incidence and persistence of discrimination

Callback rate 50% higher for applicants with white names than equally qualified applicants with black names
Bertrand, Mullainathan (2004)

No change in the degree of discrimination experienced by black job applicants over the past 25 years
Quillian, Pager, Hexel, Midtbøen (2017)

The benefits of formalizing decision-making

Formal procedures can limit opportunities to exercise prejudicial discretion or fall victim to implicit bias

Automated underwriting increased approval rates for minority and low-income applicants by 30% while improving the overall accuracy of default predictions
Gates, Perry, Zorn (2002)

The limits of formalization

Research has established that formal procedures still leave room for employers to exercise discretion selectively
Wilson, Sakura-Lemessy, West (1999)

and that bias still affects formal assessments
McKay, McDaniel (2006)

Machine learning as pinnacle of formal decision-making?

Only what the data supports?

Withhold protected features?

Automate decision-making, thereby limiting discretion?

How machines learn to discriminate

Skewed sample
Tainted examples
Limited features
Sample size disparity

B, Selbst (2016)

Skewed sample

Police records measure “some complex interaction between criminality, policing strategy, and community-policing relations”
Lum, Isaac (2016)

Skewed sample: feedback loop

Future observations of crime confirm predictions

Fewer opportunities to observe crime that contradicts predictions

Initial bias may compound over time

Tainted examples

Tainted examples: three variants

Learn to predict hiring decisions

Learn to predict who will succeed on the job (e.g., annual review score)

Learn to predict how employees will score on objective measure (e.g., sales)

Limited features

Features may be less informative or less reliably collected for certain parts of the population

A feature set that supports accurate predictions for the majority group may not for a minority group

Different models with the same reported accuracy can have a very different distribution of error across population

Sample size disparity

H (2014)


In many cases, making accurate predictions will mean considering features that are correlated with class membership

With sufficiently rich data, class memberships will be unavoidably encoded across other features


No self-evident way to determine when a relevant attribute is too correlated with proscribed features

Not a meaningful question when dealing with a large set of attributes

Discrimination Law: Two Doctrines

Disparate Treatment



Disparate Impact




Three different problems

Discovering unobserved differences in performance
Skewed sample
Tainted examples

Coping with observed differences in performance
Limited features
Sample size disparity

Understanding the causes of disparities in predicted outcome

Fairness in Machine Learning

NIPS 2017 Tutorial — Part II

Solon Barocas and Moritz Hardt

Pop-up ad

Running example: Hiring ad for (fictitious?) AI startup

Formal setup

  • $X$ features of an individual (browsing history etc.)
  • $A$ sensitive attribute (here, gender)
  • $C=c(X,A)$ predictor (here, show ad or not)
  • $Y$ target variable (here, SWE)

Note: random variables in the same probability space

Notation: $\mathbb{P}_a\{E\}=\mathbb{P}\{E\mid A=a\}.$

Formal setup

Score function is any random variable $R=r(X,A)\in[0,1].$

Can be turned into (binary) predictor by thresholding

Example: Bayes optimal score given by $r(x, a) = \mathbb{E}[Y\mid X=x, A=a]$

Three fundamental criteria

Independence: $C$ independent of $A$

Separation: $C$ independent of $A$ conditional on $Y$

Sufficiency: $Y$ independent of $A$ conditional on $C$

Lots of other criteria are related to these

First criterion: Independence

Require $C$ and $A$ to be independent, denoted $C\bot A$

That is, for all groups $a,b$ and all values $c$:
$\mathbb{P}_a\{C = c\} = \mathbb{P}_b\{C = c\}$

Variants of independence

Sometimes called demographic parity, statistical parity

When $C$ is binary $0/1$-variables, this means
$\mathbb{P}_a\{C = 1\} = \mathbb{P}_b\{C = 1\}$ for all groups $a,b.$

Approximate versions:

$$ \frac{\mathbb{P}_a\{ C = 1 \}} {\mathbb{P}_b\{ C = 1 \}} \ge 1-\epsilon $$
$$ \left|\mathbb{P}_a\{ C = 1 \}- \mathbb{P}_b\{ C = 1 \}\right|\le\epsilon $$

Achieving independence

Post-processing: Feldman, Friedler, Moeller, Scheidegger, Venkatasubramanian (2014)

Training time constraint: Calders, Kamiran, Pechenizkiy (2009)

Pre-processing: Via representation learning — Zemel, Yu, Swersky, Pitassi, Dwork (2013) and Louizos, Swersky, Li, Welling, Zemel (2016); Via feature adjustment — Lum-Johndrow (2016)

Many more...

Representation learning approach

$\max I(X ; Z)$
$\min I(A ; Z)$
"A Fair and Rich Z."
—Rich Zemel

Shortcomings of independence

Ignores possible correlation between in $Y$ and $A$.
In particular, rules out perfect predictor $C=Y.$

Premits laziness:
Accept the qualified in one group, random people in other

Allows to trade false negatives for false positives.

Conflates desirable long-term goal with algorithmic constraint

Second criterion: Separation

Require $R$ and $A$ to be independent conditional on target variable $Y$,
denoted $R\bot A \mid Y$

That is, for all groups $a,b$ and all values $r$ and $y$:
$\mathbb{P}_a\{R = r\mid Y=y\} = \mathbb{P}_b\{R = r\mid Y=y\}$

Second criterion: Separation

Require $R$ and $A$ to be independent conditional on target variable $Y$,
denoted $R\bot A \mid Y$

Definition.   Random variable $R$ separated from $A$ if $R\bot A\mid Y.$

Proposed in H, Price, Srebro (2016);
Zafar, Valera, Rodriguez, Gummadi (2016)

Desirable properties of separation

Optimality compatibility
$R=Y$ is allowed

Penalizes lazyness
Incentive to reduce errors uniformly in all groups

Recall, neither of these is achieved by independence.

Achieving separation

Method from H, Price, Srebro (2016):
Post-processing correct of score function

Post-processing: Any thresholding of $R$ (possibly depending on $A$)
No retraining/changes to $R$

Given score $R$, plot (TPR, FPR) for all possible thresholds

Look at ROC curve for each group

Feasible region: Trade-offs realizable in all groups

Given cost for (FP, FN), calculate optimal point in feasible region

Postprocessing gaurantees

Optimality preservation: If $R$ is close to Bayes optimal, then the output of postprocessing is close to optimal among all separated scores.

This does not mean it's necessarily good!

Alternatives to post-processing:
(1) Collect more data.
(2) Achieve constraint at training time.

Via optimization at training time

Explored by Woodworth-Gunasekar-Ohannessian-Srebro (2017).

Fix function class ${\cal H}$ and lost function $\ell$ solve \[ \min_{h\in{\cal H}}\mathbb{E}\ell(h(X, A), Y) \] \[ \text{ s.t. } h(X,A)\bot A\mid Y \]

Highly intractable.
Hence, consider moment relaxation of separation: \[ \sigma_{RA}\sigma_{Y}^2 = \sigma_{RY}\sigma_{YA} \] where $\sigma_{UV}=\mathbb{E}(U-\mathbb{E}U)(V-\mathbb{E}V)$ is the covariance.

Third criterion: Sufficiency

Definition.   Random variable $R$ is sufficient for $A$ if $Y\bot A\mid R.$

Why sufficiency?

For the purpose of predicting $Y$,
we don't need to see $A$ when we have $R.$

Note: Sufficiency satisfied by Bayes optimal score $r(X,A)=\mathbb{E}[Y\mid X=x,A=a].$

How to achieve sufficiency?

Sufficiency implied by calibration by group:
\[ \mathbb{P}\{ Y = 1 \mid R = r, A = a \} = r \]

Calibration by group can be achieved by
various standard calibration methods
(if necessary, applied for each group).

Calibration via Platt scaling

Given uncalibrated score $R$, fit a sigmoid function
$S = \frac{1}{1+\exp(\alpha R + \beta)}$ against target $Y$

For instance by minimizing log loss $-\mathbb{E}[Y\log S + (1-Y)\log(1-S)]$

Trade-offs are necessary

Any two of the three criteria we saw are
mutually exclusive except in degenerate cases.

Trade-offs: Independence vs Sufficiency

Proposition.   If $A\not\bot Y,$ then either independence holds or sufficiency but not both.
If $A\not\bot Y$ and $A\bot Y\mid R,$ then $A\not\bot R.$

Trade-offs: Independence vs Separation

Proposition.   If $A\not\bot Y$ and $R\not\bot Y,$ then either independence holds or separation but not both.
If $R\bot A$ and $R\bot A\mid Y,$ then either $A\bot Y$ or $R\bot Y.$

Trade-offs: Separation vs Sufficiency

Proposition.   Assume all events in the joint distribution of $(A,R,Y)$ have positive probability. If $A\not\bot Y,$ then either separation holds or sufficiency but not both.

Variants observed by Chouldechova (2016);
Kleinberg, Mullainathan, Raghavan (2016).

Trade-offs: Separation vs Sufficiency

Proposition.   Assume all events in the joint distribution of $(A,R,Y)$ have positive probability. If $A\not\bot Y,$ then either separation holds or sufficiency but not both.
Standard fact (see Wasserman Theorem 17.2):
$A\bot R\mid Y$ and $A\bot Y\mid R$ implies $A\bot (R, Y)$ (implies $A\bot Y$).
$A\not\bot Y$ implies either $A\not\bot R\mid Y$ or $A\not\bot Y\mid R$.

Visualizing trade-offs


Poster session on Wed Dec 6th 6:30–10:30p @ Pacific Ballroom #74

The COMPAS debate

Essence of COMPAS debate

ProPublica's main charge:

Black defendants face higher false positive rate.

Northpointe's main defense:

Scores are calibrated by group.

Word of caution about COMPAS debate

Corbett-Davies, Pierson, Feller, Goel, Huq (2017):

Neither calibration nor equality of false positive rates
rule out blatantly unfair practices.

Calibration is insufficient

probabilities of reoffending
detain all above 0.5

Calibration is insufficient

average reoffending rate 0.4

Calibration is insufficient

average reoffending rate 0.4
calibrated new scores

Calibration is insufficient

average reoffending rate 0.4
calibrated new scores
all below 0.5

Detention rates and FPR uninformative


Detention rates and FPR uninformative

arrest more
low risk individuals

How about other criteria?

Can we address the shortcomings of
independence, separation, sufficiency
with other criteria?

There's a fundamental issue...



All criteria we've seen so far are observational.

Passive observation of the world

No what if scenarios or interventions

This leads to inherent limitations

Observational criteria

Definition.  A criterion is observational if it's a property of the joint distribution of features $X,A$, classifier $C$, outcome $Y$.


Anything you can write down as a probability statement involving $X, A, C, Y.$

BTW, what we saw only used $A, C, Y.$

Limitations of observational criteria

H, Price, Srebro (2016):

There are two scenarios with identical joint distributions,
but completely different interpretations for fairness.

In particular, no observational definition
can distinguish the two scenarios.

Scenario I

$X_1$: visited
$X_2$: visited
separated score

Scenario II

$X_1$: obtained
CS degree
Grace Hopper
separated score
Proposition. [H, Price, Srebro (2016)]   The two scenarios admit identical joint distributions.

No observational criterion can distinguish them.

What do we make of this?

Answer to substantive social questions not
always provided by observational data.

This is part of what motivates causal reasoning.

Causal graphs


Directed graphical model with extra structure

Structural equation: $V \leftarrow f_V(U, W, N_V)$

Describes how data is generated from independent noise variables $\{N_V\}$

Examining paths in causal graphs

Inspired by Pearl's analysis of Bickel's UC Berkeley sex bias study.

Gender bias in admissions explained by
influence of gender on department choice.

Formally, assuming plausible causal graph,
only path from $A$ (gender) to decision goes through department

And, we decide that this is okay.

Examining paths in causal graphs

In Scenario II, only path from $A$ to $R^*$ goes through CS:


In Scenario I, there is a path from $A$ to $R^*$ through pinterest:




Structural equation: $V \leftarrow f_V(U, W, N_V)$

Intervention $\mathrm{do}(W\!\!\leftarrow\!\! w)$: Replace $W$ by $w$ in all structural equations

New structural equation: $V \leftarrow f_V(U, {\color{red}w}, N_V)$

Allows to set variables against their natural inclination.

Some formal possibilities

Average-causal effect of $A$ on score $R$
$\mathbb{E}[ R \mid do(A=a) ] - \mathbb{E}[R \mid do(A=b) ]$

Average-causal effect in context $X=x$
$\mathbb{E}[ R \mid do(A=a), X=x ] - \mathbb{E}[R \mid do(A=b), X=x ]$

Feasibility of interventions

But can we actually intervene on sensitive attributes (gender, race)?

Practically, generally speaking, no!

Is it conceptually possible and meaningful? Perhaps sometimes.

Advantages of proxy interventions

Consider proxies instead of underlying sensitive attributes
Kilbertus, Rojas-Carulla, Parascandolo, H, Janzing, Schölkopf (2017)
Closely related: Nabi, Shpitser (2017)

Interventions on proxies often more feasible:

  • Effect of parental leave on promotion decisions?
  • Effect of visiting pinterest.com on hiring ad?
  • Effect of name on resume screening application?

Another formal possibility: Counterfactuals

What would've happened had I been
of a different gender when applying to this job?

Leads to notion of counterfactual fairness
in Kusner, Loftus, Russell, Sliva (2017).
See talk at NIPS on Wednesday 4:50p, Hall C
Also, Russell, Kusner, Loftus, Sliva (2017).
Poster session Wed 6:30p, Pacific Ballroom #191

A hierarchy of possibilities

Inspect meaning of features No causal inference necessary
Inspect paths in causal model Qualitative causal understanding
Estimate average causal effects Causal inference and assumptions
Estimate individual level counterfactuals Strong quantitative causal understanding

Insights often depend strongly on model and assumptions!

Matchings for causal inference

Idea: match similar units in treatment and control group

Use matching for estimating causal effect

Variety of techniques, such as, propensity scores

Closely related to individual fairness.

Individual fairness

Dwork-H-Pitassi-Reingold-Zemel (2011)

Assume task specific dissimilarity measure $d(x,x')$

Require similar individuals map to similar distributions over outcomes
via map $M\colon\cal{X}\to\Delta(\cal{O})$:

$D(M(x), M(x')) \le d(x, x')$

Friedler, Scheidegger, Venkatasubramanian (2016)
Construct space
Observed space
IQ (e.g., Stanford-Binet scale)
Duckworth Grit Scale (aka NYTimes grit quiz)

Where do features come from?

Enter measurement

"the #1 neglected topic in statistics"Andrew Gelman

We'll barely even scratch the surface

See Hand (2010) for more.

Forgotten controversies?

Measurement affects scale of data

Nominal, ordinal, interval, ratio scales

How does scale affect the interpretation of statistical analyses?

Stevens (1951, p. 26):
Most of the scales used widely and effectively by psychologists are ordinal scales. In the strictest propriety the ordinary statistics involving means and standard deviations ought not to be used with these scales, for these statistics imply a knowledge of something more than the relative rank order of data. On the other hand, for this “illegal” statisticizing there can be invoked a kind of pragmatic sanction: in numerous instances it leads to fruitful results.

Classical representational measurement

Explict distinction between empirical relational system
and numerical relational system

Formal representation results (e.g., isomorphism exists)


  • Empirical relationship: Cup A "bigger" than cup B if you can pour cup B into A without overflowing.
  • Numerical system: Cups assigned to real numbers based on their volume. Relation "bigger" becomes "$>$".

Measurement in the social sciences

Often "pragmatic": Measurement procedure defines the concept

Latent variable models figure prominently
(e.g., item-response models, Rasch models)

Establishing validity of measurement is difficult, and often subjective

Construct Validity

Different criteria according to Messick:
  • Content: Do test items appear to be measuring the construct of interest?
  • Substantive: Is the construct supported by sound theoretical foundations?
  • Structural: Does the score reflect relationships in the construct domain?
  • External: Does the score successfully predict external target variables?
  • Generalizability: Does the score generalize across different populations, settings, tasks?
  • Consequential: Whare the potential risks of using the score with regards to bias, fairness, distributive justice?


Observational criteria can help discover discrimination,
but are insufficient on their own.

No conclusive proof of (un-)fairness

Causal viewpoint can help articulate problems, organize assumptions

Social questions starts with measurement

Human scrutiny and expertise irreplacable


ML is domain-specific: We need to understand legal and social context

Besides inspecting models,
scrutinize data and how it was generated

Besides static one-shot problems,
study long-term effects, feedback loops, and interventions

Establish qualitative understanding of
when/why ML is the right tool for the application

Establish understanding of what constitutes negligence

Thank you. Thank you.

Even a garbage fire brings illumination. — Paul Ford