Model | Test Group | p-value | F1 score | Conclusion |
---|---|---|---|---|
FairFace | 60-69 | ~1 | 0.354 | High p-value != High F1 score |
FairFace | Female, Given 'Other' | 2.58e-13 | 0.922 | Low p-value != Low F1 score |
DeepFace | 20-29, Given 'Black' | 0.10 | 0.588 | High p-value != High F1 score |
DeepFace | 40-69 | <=1.23e-5 | >0.928 | Low p-value != Low F1 score |
5 Conclusions
5.1 Summary of Conclusions
Before we proceed to more detailed analyses, we will provide our summarized conclusions and impactful findings.
Summarizing the answer to a key research question: We find that two-sample proportionality testing is not a good fit for analyzing the performance of a machine learning model in our use case. Drawing conclusions on model performance or bias using this method might be akin to judging the performance of a vehicle based solely upon its fuel economy (without taking into account other factors like weight, torque, horsepower, and so forth).
To have a strong conclusion, we’d expect to find a strong connection between F1/Accuracy score and the results of our proportionality testing. Generally, we do not see such a connection.
Some examples include:
However, the cases in which the proportion testing produces a significant result could be indicative that the training data for the facial recognition models, and the data we provided to them from UTKFace, have little to no overlap between one another (in terms of features and qualities of the images). This could be a topic for further research - i.e.:
Are the differences a result of feature differences (lighting in the image, centering of the subject in the image) between a model’s training data and the models’ classification predictions on novel images?
Could the source population differences be a result of over or under representation of those categories in the training data for each model?
Simply put, there are differences in the data and considerations used to train each model vs. the images we evaluated on each model from UTKFace. Absent further research, with Accuracy and F1 scores accepted as best practice - the results of our two-sample proportion hypothesis tests can only truly tell us that there is a difference in the source populations for each of our samples.
On the examination of F1 scores, we have the following top findings per-model:
Model | Findings |
---|---|
FairFace | Preference in classifying the very young and old correctly. Excellence in classifying gender given almost any other test categories. Lack of excellence in racial classifications; preference in classification of Asian, White, and Black over Indian and Other. |
DeepFace | Poor performance in approx. 75% of tested categories. Preference of Male Classification over Female. Highest preference is to correctly classify White subjects. Substantially poor performance in classifying Indian and Other subjects. Failure to detect the very young and old (0-9, 70-130) and generally poor in every age category, but more correct predictions of age given the subject is White. |
These preferential biases can result in discrimination if used for decision-making processes:
FairFace’s poorer performance for Indian and Other categories is concerning. Especially considering that “Other” includes groups such as Middle Eastern, Latino Hispanic, and more. With low F1 scoring, this could be impactful on multiple racial groups. If used to make decisions, it could generate disparate impact against Indian and Other.
DeepFace’s combination of correct age given White as race could result in a combination of racial and age discrimination against people of color.
These models freely available to the public. Were a developer to incorporate these models into a business product to enable decision-making by stakeholders and/or customers, it opens the door to risk of discrimination on the basis of protected classes. Having accountability, well-documented findings, and disclaimers in open-source models is important.
DeepFace does note some of its shortcomings in its paper, and providing these disclaimers is necessary so that users know both the capabilities and limiations of open-source systems before deciding where and how to use them.
FairFace’s improvements show progress over the last few years in open-source ML models accounting for race/age/gender gaps in prediction. That being said, there’s still further progress to be made. While it had great performance in the young and old, and for both genders, the disparity in performance for Indian and Other races is concerning.
We should, collectively, uphold standards for excellent, not just okay or good, performance, and these standards should be upheld for all protected classes.
5.2 Evaluation of Test Results
To evaluate our tests, we will first examine our hypothesis tests, and then move on to evaluate F1 and Accuracy scores. We theorize that our hypothesis testing, specifically in cases in which we reject the null, may tell us where bias may exist in our data. Separately, F1 and Accuracy scores may tell us specific instances where bias exists in favor of, or against, specific protected classes. Throughout this section, we will use the language “potential bias” for any scenario in which we reject the null hypothesis.
5.3 Hypothesis Testing Results
The design of our hypothesis testing provides us with cases in which datum from the source population differs from that of the predicted population of each model. This may show where biases exist. When the hypothesis test result produces a value less than 0.003 and with test power greater than or equal to 0.8, the test could be indicative of bias. Inversely, a p-value greater than or equal to 0.003 will not provide sufficient evidence to indicate bias in the given test case. The p-value alone, however, cannot tell us whether the indicated bias is in favor of, or against, the protected class group(s) in question. This is because the the hypothesis tests only tell us the probability that the source and predicted results come from the same population.
Model | Test Category | Potentially Effected Categories | Potentially Effected Sub-Categories |
---|---|---|---|
FairFace | Age | 0-29, 40-49, and 70-130 | 0-19 (given Female), 40-49 (given Male); varying potential biases against sub-groups such as 0-2 (given Asian, Indian, Other, or White), 3-9 (Given Asian), 10-19 (Other,White), 20-29 (Asian, Black), 30-39 (Asian), 40-49 (White, Other), 50-59 (Other), 70-130 (Other, White, Black) |
FairFace | Gender | No evidence supports a conclusion of potential bias for gender alone | any gender (given 0-2, 20-39, or 70-130); any gender (given White or Other) |
FairFace | Race | Black, Indian, White, Other | Black, Indian, Other (given any gender), White (given male); Asian (given 30-39) |
DeepFace | Age | All age groups | All age groups (given gender); |
DeepFace | Gender | Both genders | All genders (given Asian, Black, Indian, Other - [White didn't meet power threshold]); All genders (given age 10-69) |
DeepFace | Race | All races | All races (except Black, given Male); All races (given 10-49), Black (givne 50-69), and Indian (given 60-69) |
5.4 Examination of Potential Biases using F1 Scores
When examining p-values for potential areas of bias, our hypothesis testing results did not well-align with our F1 score calculations. E.g. a rejection of the null hypothesis did not directly translate to a low F1 score, with the inverse also being true. We proceeded to examine F1 scores, separate of p-value and power results from our hypothesis tests.
General trends for both models: many categories and sub-categories of protected classes fail to meet our selected definition of excellence (F1 score of 0.9 or more). FairFace had more results meeting our definition of excellence compared to DeepFace. Both models demonstrate preference in classification for specific age groups, races, and genders, and both seem to display biases against Indian and Other racial categories. Examining a particular class of subjects, given additional controlling variables, reveal nested biases for and against various classes.
Model | Test Category | Impacted Categories | Impacted Sub-Categories |
---|---|---|---|
FairFace | Age | No groups meet 0.9 threshold. Preference in classifying the young and old correctly (0-9, 20-29, and 70-130); all other categories fall below 0.7, and between 0.1 to 0.4 below these top groups | Given gender, the top performing groups remain, with a preference for Female classification over Male in the groups. Given race, the very young meet excellence for 0-2, given Other, Asian, or White. 10-69 fall far behind given any racial group |
FairFace | Gender | Both genders meet standard for excellence. | Given race, all genders meet excellence, except male, given Asian at 0.89. Given age, all genders meet excellence, less males given 0-9, and females given 0-2 |
FairFace | Race | None reach excellence. Preference for Asian, Black, White (Other, Indian fall 0.2-0.3 less than these groups) | Given gender, model retains preference for classifying for Asian, White, Black. Given Age, excellence reached for White (given 3-9,60-69,) |
DeepFace | Age | All ages perform poorly (max F1 0.51). 0-9,70-130 (0 detections) and 10-19, 50-69 (very few detections at F1 < 0.1) | All groups (given gender), male performance slightly surprasses female performance in every age category. Additional age groups fail detection (0 count) for 60-69, given White, Asian, or Indian. |
DeepFace | Gender | Near equal performance on both genders, male preferred over female. | No gender classifications meet excellence, given race. Preference for Male over Female for near all races (excluding Other). Preference for stronger male classification, given 40-69, and female, given 10-39. |
DeepFace | Race | None meet excellence. Bias against Indian, Other with F1 <= 0.4 | Poor for all races (given gender), with slightly lower performance for race classificaiton given a female subject. |
5.4.1 Age
When examining the results of the F1 scores for age, no categories for DeepFace met our specification for excellence. This identifies potential points of improvement in age categorization on part of DeepFace. As DeepFace is unable to detect faces between the ages of 0-9 and 70-130, there is a bias against very young and very old faces. Additionally, The group with the highest F1 performance is 20-29, implying a favorable bias towards subjects in early adulthood.
FairFace’s overall age calculations, absent other conditional variables, failed to produce any category that met our F1 threshold, implying lack of excellence in correct predictions for any one age group. However, the categories that did perform the best had a preferential bias towards the very young and very old faces, almost in opposition to DeepFace; FairFace displayed a preferential bias towards the ages of 0-9 and 70-130. When examining specific sub-categories, FairFace presented notable favorable bias to identify male faces between the ages of 0-2 in the White, Asian, and Other categories, as only those categories passed the F1 threshold.
5.4.2 Race
Compared to age results, race performs significantly worse for both DeepFace and FairFace, due to the fact that no racial category on its own reaches our F1 threshold. Both models show preference for certain races. In order of preference, DeepFace shows a preferential bias for classifying White, Black, and Asian faces, and FairFace shows a similar bias for classifying Asian, Black and White faces. Indian and Other faces perform the worst overall for both models, with significantly lower F1 scores than the preferred categories, by at least 0.2 for FairFace and 0.3 for DeepFace. As such, these preferences are substantial.
In terms of race with additional control variables, DeepFace demonstrates exceedlingly poor performance. No category for race given age scores surpassed our F1 threshold. Overall, White faces score the highest, provided the identified faces are not 0-9. For FairFace, the only noted bias was a preference for Asian faces younger than 20, and White faces in the ranges of 0-9 and 60-130. For gender-specific biases, there are also no categories that meet or surpass our F1 threshold, but it should be emphasized that DeepFace identified male faces for all races better than female faces. FairFace had a similar performance, except for Indian faces, where male faces scored above female ones.
5.4.3 Gender
Gender shows a similar pattern as race for overall evaluation. DeepFace fails to have any category meet or exceed our F1 threshold, but male faces do show a slightly higher score than female ones. FairFace had both male and female faces score above 0.9, showing a notably positive performance, with little to no difference between males and females.
DeepFace did show preference for certain genders given age, with the range of 30-69 performing above 0.9 for male faces, but only females age 20-29 were significant. This implies a positive bias towards identifying older male faces, as well as bias towards younger adult women. FairFace was more balanced, with significant scores for most age groups except for females age 0-2, and males 0-9. This showcases a negative bias against very young people in general, and particularly male children. For Gender given race, DeepFace had no statistically significant f1 scores, but did show a positive bias towards White faces, and negative biases towards Asian faces of all genders, and Black female faces. FairFace was far better in all categories, with f1 scores over 0.9 for all categories except Asian male faces. Therefore, it shows a significant negative bias against identifying Asian male faces.
5.5 Areas for Further Research
Do the differences in source populations between an source dataset (i.e. UTKFace) and a facial recognition model indicate any of the following:
feature differences between a model’s training data and the model’s classification predictions on novel images?
A difference in the specific features trained in each model?
A lack of overlapping features or qualities from the source and predicted dataset?
Do source population similarities between the datasets indicate any of the following:
similar features between model training dataset and source dataset?
Presence of the same images between the model training dataset and source dataset?
5.6 Mathematical Support for Conclusions on Hypothesis Testing
F1 and Accuracy scores are generally accepted as best practice in evaluating the efficacy of machine learning models. From our tests, we saw contradictions between two-sample proportion tests and F1/Accuracy scores with respect to each model. This is directly evident from Figure 4.3 and Figure 4.4 with a clear lack of correlation of any type between the variables, for all our 432 hypothesis tests.
We can examine this further. An Accuracy or F1-score of 0.9 is a reasonable threshold for an “excellent” peforming model. We could set this this threshold as analogous to the outcomes of a hypothesis test. If a model is performing well, we would expect there wouldn’t be enough evidence to reject the null hypothesis (i.e. equal proportions between the source and model could not be statistically rejected). If a model is is not peforming well, we would expect there to be enough evidence suggesting we should reject the null hypothesis in favor of the alternative hypothesis (i.e. there was enough evidence the proportions between the source and model were not equal).
In that perspective, if we assume that the sample outputs’ F1 scores should reject the null hypotheses when below a certain threshold, and fail to reject when above 0.9, we can build a confusion matrix of “prediction” to reject or fail to reject the null using two-sample proportion tests, in comparison to a “correct” result using sample out F1 scores. We should use this same threshold, as it’s the same that we set for each model in evaluating protected classes.
Pursuing such an evaulation is an appropriate approach, because the methods we’ve leveraged for attempting to examine bias using proportionality testing is a model, just as classification of inputs and outputs using confusion matrices is a model. A standard method of evaluating model performance is via confusion matrices.
Such matrices produce the following results when evaluating our sample outputs:
model | test categorization | accuracy | F1 | threshold |
---|---|---|---|---|
DeepFace | Class: Reject null | 0.4539493 | 0.5400000 | 0.9 |
DeepFace | Class: Fail to reject null | 0.6257857 | 0.4255319 | 0.9 |
DeepFace | Class: Unknown | 0.3755174 | 0.0227273 | 0.9 |
FairFace | Class: Reject null | 0.5386997 | 0.4898990 | 0.9 |
FairFace | Class: Fail to reject null | 0.5223073 | 0.5174825 | 0.9 |
FairFace | Class: Unknown | 0.4559165 | NaN | 0.9 |
Model | p-values | Pearson Correlation Coefficient | Confidence Level |
---|---|---|---|
FairFace | 0.0011798 | 0.1346131 | 0.95 |
DeepFace | 0.0264197 | 0.1557421 | 0.95 |
Assuming that a correct decision to reject or fail to reject the null should be based upon an F1 and Accuracy scores at multiple thresholds (0.9, 0.8, or 0.7), we see substantially low accuracy and F1 scores for two-sample proportionality tests as a model for predicting machine learning model performance. Examining any type of Pearson correlation between the p-values and F1 scores, we see similar results. This highlights the contradictions we witnessed in our results for two-sample proportion tests vs. leveraging accuracy and F1 scores. Given These results, we find that two-sample proportionality testing is likely not a strong indicator to identify issues and errors in machine learning models.