Skip to main content

Calculating the Jaccard Coefficient

· 5 min read
Ross Bulat
Full Stack Engineer

Introduction

This portfolio activity calculates the Jaccard coefficient for three pairs of individuals using a small table of pathological test results. The three individuals are Jack, Mary and Jim, and the required comparisons are:

  • Jack and Mary
  • Jack and Jim
  • Jim and Mary

What is the Jaccard coefficient?

The Jaccard coefficient is a similarity measure for binary attributes. It compares two records by computing the ratio of mismatches to the sum of mismatches and matches where both records have a value of 1.

The formula is:

J(A, B) = (f₀₁ + f₁₀) / (f₀₁ + f₁₀ + f₁₁)

Where:

  • f₁₁ = number of attributes where both records have value 1
  • f₀₁ = number of attributes where record A has 0 and record B has 1
  • f₁₀ = number of attributes where record A has 1 and record B has 0
  • Matches where both have 0 are ignored in the denominator.

For this activity, each person is treated as a set of attribute-value pairs, excluding the Name field. For example, Jack’s record is represented as:

  • Gender = M
  • Fever = Y
  • Cough = N
  • Test-1 = P
  • Test-2 = N
  • Test-3 = N
  • Test-4 = A

Binary Conversion

Asymmetric variables are converted to binary:

  • Y and P → 1
  • N and A → 0

Gender is NOT converted because it is a symmetric variable (male and female have equal weight). Gender is used only for verifying the records are comparable.

After binary conversion, each person has 6 binary attributes (excluding Name and the unconverted Gender). For example, Jack's record becomes:

  • Gender = M (not converted)
  • Fever = Y → 1
  • Cough = N → 0
  • Test-1 = P → 1
  • Test-2 = N → 0
  • Test-3 = N → 0
  • Test-4 = A → 0

Source Data

NameGenderFeverCoughTest-1Test-2Test-3Test-4
JackMYNPNNA
MaryFYNPAPN
JimMYPNNNA

Calculation Method

For each pair of individuals, I converted asymmetric variables to binary (Y & P = 1; N & A = 0), excluded Gender from binary conversion, and then compared the 6 binary attributes.

For each pair, I calculated:

  1. f₁₁: Count of attributes where both individuals have value 1
  2. f₀₁ + f₁₀: Count of attributes where one has 1 and the other has 0
  3. The Jaccard coefficient using: J = (f₀₁ + f₁₀) / (f₀₁ + f₁₀ + f₁₁)

Pair 1: Jack and Mary

Original Data

Jack: Gender=M, Fever=Y, Cough=N, Test-1=P, Test-2=N, Test-3=N, Test-4=A
Mary: Gender=F, Fever=Y, Cough=N, Test-1=P, Test-2=A, Test-3=P, Test-4=N

Binary Conversion

Jack (binary): Gender=M (not converted), 1, 0, 1, 0, 0, 0
Mary (binary): Gender=F (not converted), 1, 0, 1, 0, 1, 0

Attribute Comparison

Comparing the 6 binary attributes (skipping Gender):

PositionAttributeJackMaryMatch Type
1Fever11f₁₁ (both 1)
2Cough00Ignored
3Test-111f₁₁ (both 1)
4Test-200Ignored
5Test-301f₀₁ (mismatch)
6Test-400Ignored

Calculation

  • f₁₁ = 2 (Fever, Test-1)
  • f₀₁ + f₁₀ = 1 (Test-3)

Jaccard coefficient:

J(Jack, Mary) = (1) / (1 + 2) = 1/3 = 0.333

Result

The Jaccard coefficient for Jack and Mary is:

0.33


Pair 2: Jack and Jim

Original Data

Jack: Gender=M, Fever=Y, Cough=N, Test-1=P, Test-2=N, Test-3=N, Test-4=A
Jim: Gender=M, Fever=Y, Cough=P, Test-1=N, Test-2=N, Test-3=N, Test-4=A

Binary Conversion

Jack (binary): Gender=M (not converted), 1, 0, 1, 0, 0, 0
Jim (binary): Gender=M (not converted), 1, 1, 0, 0, 0, 0

Attribute Comparison

Comparing the 6 binary attributes (skipping Gender):

PositionAttributeJackJimMatch Type
1Fever11f₁₁ (both 1)
2Cough01f₀₁ (mismatch)
3Test-110f₁₀ (mismatch)
4Test-200Ignored
5Test-300Ignored
6Test-400Ignored

Calculation

  • f₁₁ = 1 (Fever)
  • f₀₁ + f₁₀ = 2 (Cough, Test-1)

Jaccard coefficient:

J(Jack, Jim) = (2) / (2 + 1) = 2/3 = 0.667

Result

The Jaccard coefficient for Jack and Jim is:

0.67


Pair 3: Jim and Mary

Original Data

Jim: Gender=M, Fever=Y, Cough=P, Test-1=N, Test-2=N, Test-3=N, Test-4=A
Mary: Gender=F, Fever=Y, Cough=N, Test-1=P, Test-2=A, Test-3=P, Test-4=N

Binary Conversion

Jim (binary): Gender=M (not converted), 1, 1, 0, 0, 0, 0
Mary (binary): Gender=F (not converted), 1, 0, 1, 0, 1, 0

Attribute Comparison

Comparing the 6 binary attributes (skipping Gender):

PositionAttributeJimMaryMatch Type
1Fever11f₁₁ (both 1)
2Cough10f₁₀ (mismatch)
3Test-101f₀₁ (mismatch)
4Test-200Ignored
5Test-301f₀₁ (mismatch)
6Test-400Ignored

Calculation

  • f₁₁ = 1 (Fever)
  • f₀₁ + f₁₀ = 3 (Cough, Test-1, Test-3)

Jaccard coefficient:

J(Jim, Mary) = (3) / (3 + 1) = 3/4 = 0.75

Result

The Jaccard coefficient for Jim and Mary is:

0.75


Summary of Results

Pairf₁₁f₀₁ + f₁₀Jaccard Coefficient
Jack and Mary210.33
Jack and Jim120.67
Jim and Mary130.75

Interpretation

The highest similarity is between Jack and Mary, with a Jaccard coefficient of 0.33. This indicates the fewest mismatches among attributes where at least one individual has a value of 1.

The similarity between Jack and Jim is moderate, with a coefficient of 0.67. They differ on the Cough and Test-1 attributes while sharing the Fever characteristic.

The lowest similarity is between Jim and Mary, with a coefficient of 0.75. This indicates the most mismatches among 1-valued attributes, particularly differing on Cough, Test-1, and Test-3.

Note: In binary Jaccard coefficient calculations, lower coefficients indicate greater similarity (fewer mismatches), while higher coefficients indicate greater dissimilarity (more mismatches).