Annotation of Soft Biometric Attributes:

The Elephant in the Room

Soft biometrics is a type of biometric information that can be used to identify or describe individuals based on physical or behavioral characteristics that are more subjective and less discriminative than traditional biometric identifiers like faces, fingerprints, or iris scans. They may include physical characteristics such as height, weight, hair color, facial hair, and others, as well as behavioral traits such as gait, typing rhythm, or voice. In face-based biometric applications, there is a frequent focus on demographic attributes, such as age, gender, and ethnicity.

While these demographic attributes are constantly used in various fields of research such as AI fairness or privacy, there is a need for discussion surrounding the definitions of these attributes and the issues that might arise from them.

In this blog post, I will attempt to delve into a brief discussion regarding how demographic attributes are labeled and why I think it is important to reconsider how we approach the way in which they are defined.

How is data labeled usually?

In some cases, the collection and the labeling of data are quite straightforward via having subjects fill out survey forms and the labeling or coding of the variables is done beforehand during the survey feasibility phase. In this case, data subjects consenting to provide their data will usually form their data entry themselves by picking among the labels or modalities of each variable, the correct or most accurate label known as the ground truth.

However, in other cases amongst which we count the collection of data without necessarily having the cooperation or consent of the data subjects, the data entries are usually created by a third party and not by the data subject. Often, this is the case for data collected from the internet.

There are three main ways that data is labeled, including biometric data such as facial images. These are manual labeling, automatic labeling, and semi-supervised labeling.

  1. Manual labeling: This way of labeling involves having human annotators review and label data according to specific criteria, such as sentiment analysis, topic classification, or named entity recognition. Manual labeling can be time- consuming and expensive, but it can result in highly accurate labels.
  2. Automatic labeling: This involves using machine learning algorithms to automatically label data based on patterns or features in the data. Automatic labeling can be faster and more efficient than manual labeling, but it may be less accurate or require additional human review to ensure the labels are correct.
  3. Semi-supervised labeling: This involves combining both manual and automatic labeling to improve the accuracy of labeled data. For example, a small set of data may be manually labeled, and then a machine learning algorithm can be trained to automatically label the remaining data based on the patterns observed in the manually labeled data.


In practice, it is very often that if the data collected is massive the last common two methods are used to label information that is not already present in the data. That is, for instance, the case of facial image datasets with a large number of attributes such as the LFWA dataset [1].

Image source: Flea Market - Analog Toys - Lego Star Wars World. Made with Canon 5d Mark III and loved analog lens, Leica Summilux-R 1.4 50mm (Year: 1983) (Free to use under the Unsplash License)

How to monitor the quality of the labels?

Usually, after labeling is done via human annotators, there are a few metrics to guarantee the quality of the annotations and they measure the agreement rate.

This agreement rate refers to the inter-annotator agreement in the context of data labeling or annotation. It describes the level of agreement or consistency between multiple annotators who are tasked with labeling the same set of data. It can be measured using various metrics, such as Cohen's kappa or Fleiss' kappa, which take into account the level of agreement that would be expected by chance. A high agreement rate indicates that the annotators are consistent in their labeling decisions, while a low agreement rate indicates that there is more variation or disagreement between annotators. Before publishing labeled data, it is important to guarantee a sufficiently high agreement rate because it indicates that the labeled data is reliable and consistent, which can improve the performance of machine learning algorithms that are trained on the data.

However, achieving a high agreement rate can be challenging, especially when dealing with complex or subjective data, such as natural language text or images. To improve the agreement rate, it's important to provide clear labeling instructions and guidelines, train annotators on the task and criteria, and have a system in place for reconciling any differences or inconsistencies between annotators. Additionally, using multiple annotators and measuring the agreement rate can help identify and address any issues or ambiguities in the labeling task.

In practice, what are the inconsistencies present in labeling facial images?

While in natural language processing (NLP) research, the agreement rate is an important factor to assess the quality of labeled data, there is often a lack of transparency regarding the labeling of soft biometric attributes in facial image datasets despite that in both fields, perception plays an important part in how the data is labeled.

For example, the CelebA dataset which is used extensively in the face biometric community and contains labels of 40 soft biometric attributes [1], is claimed to have been labeled by a professional company but does not provide information about the annotators or their agreement rate. Adience, another facial image dataset that is also claimed to be manually labeled for age and sex categories, but in the same manner its authors do not provide the number of annotators or a measure of their agreement rate.

Additionally, there is a need to rethink how we approach the labeling of demographic attributes, as the current categories can be limited and not reflective of the diversity of human identities, especially when they are being assigned without the consent or input of the data subjects.

Taking race or ethnicity as an example, different datasets use a different number of categories;

For instance, in CelebA, ethnicity is coded as a 3-modality variable, it has “Asian”, “Black” and “White” while UTKFace [2] has ethnicity coded as a 5-modality variable with the following categories: White, Black, Asian, Indian, and Others (like Hispanic, Latino, Middle Eastern)

In addition to that, it is not mentioned in either of the articles documenting this dataset collection what are the standard definitions for each of the categories that the annotators relied on as the modalities with the same title could not hold the same description in both datasets. This lack of coherence exacerbates the already present issue of classifying people according to their ethnicity without their knowledge or consent based on solely perception. This is also the case for highly subjective attributes such as “attractiveness” in CelebA.

Furthermore, all references to gender as a biometric attribute in facial image datasets I encountered, are in terms of a binary variable with the labels “male” and “female”. This may raise issues as the definition of gender is not binary, contrary to sex. Sex refers to the biological characteristics, such as reproductive organs and chromosomes, that define an individual as male or female. Gender refers to the social, psychological, cultural, and behavioral aspects of being a man, woman, or someone with another gender identity which can vary across different cultures and societies. While sex is typically binary (male or female), gender is a more complex and fluid concept that can encompass a range of identities beyond the binary categories.

Meanwhile, the ISO standard ISO/IEC 2382-37:2022 for biometrics terminology [3], contains a correct definition of sex and gender, but both have the same proposed modalities “male” and “female” which are sex-based terms. Since annotators label the facial images based on perception, it could be more accurate to specify that what is actually labeled is the perceived gender expression of the data subjects rather than their gender identity. A person could be “male” in terms of sex but identifying as a woman and presenting as what is perceived as typically feminine or not. Gender expression could also be gender-neutral, which is usually ambiguous for either of the binary categories and a third defined category should probably be more used in practice to contain such gender expression during labeling.

The shortcomings described in the above points can make it difficult to assess the reliability of the labeled data, which can in turn, negatively impact the performance of machine learning algorithms.

To summarize, labeling demographic data is a complex task that requires careful consideration and attention to detail. The way in which data is labeled can have a significant impact on the reliability and accuracy of machine learning algorithms. It is important to monitor the quality of labels, particularly in datasets related to soft biometrics such as facial images, and to take into account the subjectivity and diversity of human identities when defining demographic attributes. Ultimately, a more thoughtful and inclusive approach to labeling will lead to better outcomes for both research and society as a whole.

[1] Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3730-3738).

[2] Zhang, Zhifei, Song-Hai Zhang, and Jian Yang. "UTKFace dataset." (2017).

[3]: ISO/IEC 2382-37:2022, Information technology — Vocabulary — Part 37: Biometrics.

This blogpost was written by Zohra Rezgui. Since 2020, she is a PhD candidate at the University of Twente. Her research within PriMa aims to mitigate biometric profiling in facial images and templates.