Most of the existing facial datasets for machine learning consist of imbalanced distribution of classes. For example, a dataset could have 70% of its pictures belong to the class Male, or 80% of its pictures belong to the class White. This imbalance in datasets propagates implicit biases into facial detection tools against the least represented classes.
This project's goal is to provide software that generates facial datasets depending on your needs. Consider the following scenario, AUB wants to move to an ID-less campus where students don't have to present their identification cards at the gate but instead will be identified through cameras while they enter. Unfortunately, the existing datasets to train such a model won't be sufficient, given that most of those datasets consist of White Females and Males and depict faces from one angle.
Here, AUB gets the assistance of our startup, where we work on generating these datasets from existing video feeds or a small sample of faces.
The process:
All faces get extracted from the video feeds, making sure to extract all faces of all ethnicities and different clothing styles and from different perspectives.
Given that we are going to extract every face picture we will have multiple pictures of the same face but from different perspectives and wearing different clothing.
This problem will be solved using DBSCAN clustering which will result in clustering all the pictures of the same face into one cluster.
All the extracted and clustered faces will be passed into multiple different models that will categorize them by age, gender, and ethnicity, resulting in distribution statistics over our desired classes (for example our current dataset has 20% males, and 30% black females and so on).
Cases of data shortage will be evident, for example, as AUB may not be able to provide us with sufficient video feeds. However, small samples will be sufficient for us to generate fake pictures of faces using GAN models. These models will be built and tuned to generate fake pictures to augment or create most of the dataset.
All our datasets will guarantee equal distribution among classes with the aim of preventing implicit biases in our data as our main goal is to build unbiased AI.
Moreover, our startup will also provide support and consulting services for the clients while they are building their models, in this scenario, our scientists and engineers will provide help to AUB during the process of developing the model.
In the future, we plan to expand into multiple kinds of data and multiple applications (textual data, speech data, medical data.. etc )