Calculating the Jaccard Coefficient
Introduction
This portfolio activity calculates the Jaccard coefficient for three pairs of individuals using a small table of pathological test results. The three individuals are Jack, Mary and Jim, and the required comparisons are:
- Jack and Mary
- Jack and Jim
- Jim and Mary
What is the Jaccard coefficient?
The Jaccard coefficient is a similarity measure for binary attributes. It compares two records by computing the ratio of mismatches to the sum of mismatches and matches where both records have a value of 1.
The formula is:
J(A, B) = (f₀₁ + f₁₀) / (f₀₁ + f₁₀ + f₁₁)
Where:
- f₁₁ = number of attributes where both records have value 1
- f₀₁ = number of attributes where record A has 0 and record B has 1
- f₁₀ = number of attributes where record A has 1 and record B has 0
- Matches where both have 0 are ignored in the denominator.
For this activity, each person is treated as a set of attribute-value pairs, excluding the Name field. For example, Jack’s record is represented as:
- Gender = M
- Fever = Y
- Cough = N
- Test-1 = P
- Test-2 = N
- Test-3 = N
- Test-4 = A
Binary Conversion
Asymmetric variables are converted to binary:
- Y and P → 1
- N and A → 0
Gender is NOT converted because it is a symmetric variable (male and female have equal weight). Gender is used only for verifying the records are comparable.
After binary conversion, each person has 6 binary attributes (excluding Name and the unconverted Gender). For example, Jack's record becomes:
- Gender = M (not converted)
- Fever = Y → 1
- Cough = N → 0
- Test-1 = P → 1
- Test-2 = N → 0
- Test-3 = N → 0
- Test-4 = A → 0
Source Data
| Name | Gender | Fever | Cough | Test-1 | Test-2 | Test-3 | Test-4 |
|---|---|---|---|---|---|---|---|
| Jack | M | Y | N | P | N | N | A |
| Mary | F | Y | N | P | A | P | N |
| Jim | M | Y | P | N | N | N | A |
Calculation Method
For each pair of individuals, I converted asymmetric variables to binary (Y & P = 1; N & A = 0), excluded Gender from binary conversion, and then compared the 6 binary attributes.
For each pair, I calculated:
- f₁₁: Count of attributes where both individuals have value 1
- f₀₁ + f₁₀: Count of attributes where one has 1 and the other has 0
- The Jaccard coefficient using: J = (f₀₁ + f₁₀) / (f₀₁ + f₁₀ + f₁₁)
Pair 1: Jack and Mary
Original Data
Jack: Gender=M, Fever=Y, Cough=N, Test-1=P, Test-2=N, Test-3=N, Test-4=A
Mary: Gender=F, Fever=Y, Cough=N, Test-1=P, Test-2=A, Test-3=P, Test-4=N
Binary Conversion
Jack (binary): Gender=M (not converted), 1, 0, 1, 0, 0, 0
Mary (binary): Gender=F (not converted), 1, 0, 1, 0, 1, 0
Attribute Comparison
Comparing the 6 binary attributes (skipping Gender):
| Position | Attribute | Jack | Mary | Match Type |
|---|---|---|---|---|
| 1 | Fever | 1 | 1 | f₁₁ (both 1) |
| 2 | Cough | 0 | 0 | Ignored |
| 3 | Test-1 | 1 | 1 | f₁₁ (both 1) |
| 4 | Test-2 | 0 | 0 | Ignored |
| 5 | Test-3 | 0 | 1 | f₀₁ (mismatch) |
| 6 | Test-4 | 0 | 0 | Ignored |
Calculation
- f₁₁ = 2 (Fever, Test-1)
- f₀₁ + f₁₀ = 1 (Test-3)
Jaccard coefficient:
J(Jack, Mary) = (1) / (1 + 2) = 1/3 = 0.333
Result
The Jaccard coefficient for Jack and Mary is:
0.33
Pair 2: Jack and Jim
Original Data
Jack: Gender=M, Fever=Y, Cough=N, Test-1=P, Test-2=N, Test-3=N, Test-4=A
Jim: Gender=M, Fever=Y, Cough=P, Test-1=N, Test-2=N, Test-3=N, Test-4=A
Binary Conversion
Jack (binary): Gender=M (not converted), 1, 0, 1, 0, 0, 0
Jim (binary): Gender=M (not converted), 1, 1, 0, 0, 0, 0
Attribute Comparison
Comparing the 6 binary attributes (skipping Gender):
| Position | Attribute | Jack | Jim | Match Type |
|---|---|---|---|---|
| 1 | Fever | 1 | 1 | f₁₁ (both 1) |
| 2 | Cough | 0 | 1 | f₀₁ (mismatch) |
| 3 | Test-1 | 1 | 0 | f₁₀ (mismatch) |
| 4 | Test-2 | 0 | 0 | Ignored |
| 5 | Test-3 | 0 | 0 | Ignored |
| 6 | Test-4 | 0 | 0 | Ignored |
Calculation
- f₁₁ = 1 (Fever)
- f₀₁ + f₁₀ = 2 (Cough, Test-1)
Jaccard coefficient:
J(Jack, Jim) = (2) / (2 + 1) = 2/3 = 0.667
Result
The Jaccard coefficient for Jack and Jim is:
0.67
Pair 3: Jim and Mary
Original Data
Jim: Gender=M, Fever=Y, Cough=P, Test-1=N, Test-2=N, Test-3=N, Test-4=A
Mary: Gender=F, Fever=Y, Cough=N, Test-1=P, Test-2=A, Test-3=P, Test-4=N
Binary Conversion
Jim (binary): Gender=M (not converted), 1, 1, 0, 0, 0, 0
Mary (binary): Gender=F (not converted), 1, 0, 1, 0, 1, 0
Attribute Comparison
Comparing the 6 binary attributes (skipping Gender):
| Position | Attribute | Jim | Mary | Match Type |
|---|---|---|---|---|
| 1 | Fever | 1 | 1 | f₁₁ (both 1) |
| 2 | Cough | 1 | 0 | f₁₀ (mismatch) |
| 3 | Test-1 | 0 | 1 | f₀₁ (mismatch) |
| 4 | Test-2 | 0 | 0 | Ignored |
| 5 | Test-3 | 0 | 1 | f₀₁ (mismatch) |
| 6 | Test-4 | 0 | 0 | Ignored |
Calculation
- f₁₁ = 1 (Fever)
- f₀₁ + f₁₀ = 3 (Cough, Test-1, Test-3)
Jaccard coefficient:
J(Jim, Mary) = (3) / (3 + 1) = 3/4 = 0.75
Result
The Jaccard coefficient for Jim and Mary is:
0.75
Summary of Results
| Pair | f₁₁ | f₀₁ + f₁₀ | Jaccard Coefficient |
|---|---|---|---|
| Jack and Mary | 2 | 1 | 0.33 |
| Jack and Jim | 1 | 2 | 0.67 |
| Jim and Mary | 1 | 3 | 0.75 |
Interpretation
The highest similarity is between Jack and Mary, with a Jaccard coefficient of 0.33. This indicates the fewest mismatches among attributes where at least one individual has a value of 1.
The similarity between Jack and Jim is moderate, with a coefficient of 0.67. They differ on the Cough and Test-1 attributes while sharing the Fever characteristic.
The lowest similarity is between Jim and Mary, with a coefficient of 0.75. This indicates the most mismatches among 1-valued attributes, particularly differing on Cough, Test-1, and Test-3.
Note: In binary Jaccard coefficient calculations, lower coefficients indicate greater similarity (fewer mismatches), while higher coefficients indicate greater dissimilarity (more mismatches).
