Training Algorithms: Why AI Shouldn’t Be Left Alone

65% of companies in Spain risk becoming irrelevant if they do not adopt big data strategies, a sector growing at an annual rate of 30%. Machine Learning allows models to learn automatically but requires human supervision to avoid negative outcomes, as seen with Microsoft's bot. The quality and quantity of data, as well as accurate and efficient labeling, are key to successful Machine Learning model training.

Manuel Allegue
September 26, 2024

Building a model for data analysis based on AI requires supervised training by humans to avoid biases and unintended effects.

According to the Cotec Foundation, 65% of companies risk becoming irrelevant or non-competitive if they do not adopt big data strategies, a sector that grows by 30% annually in Spain.

When tackling a project centered around data, one often overlooked stage is model construction. The model is capable of creating systematic procedures and rules around the data to find a solution to a problem. Once the model is built, it is possible to outline scenarios based on any available information.

In this regard, Machine Learning is a significant revolution because the use of computer algorithms allows models to learn automatically through experience. In fact, the quality and quantity of this learning are as crucial to the success of the data project as the algorithms themselves. However, this learning should not occur in complete “solitude,” and this is where I want to pause.

Machine Learning algorithms learn from data, comparing relationships, developing understanding, making decisions, and evaluating based on the data they receive, but this training necessarily requires human oversight. When I say that the model doesn’t learn alone, I am reminded of the paradigm case of Microsoft’s Tay, the experimental bot designed to understand interactions between computers and humans on social networks. The experiment went wrong, and within a few hours, due to interaction with certain users in real-time, the bot became xenophobic and racist and had to be taken down. Its training had not received sufficient human monitoring.

Coming back to the issue to training, the better the quality and quantity of data for learning, the better the model will perform. But even if the model has a large amount of well-structured data, it does not guarantee proper training. For example, autonomous vehicles need not only images of a street but labeled images of each car, pedestrian, street sign, among others. Feeling analysis projects require labels to help an algorithm understand when someone is using irony or sarcasm. Chatbots need to understand syntactic analysis, tones, among other aspects.

Of course, more complex use cases generally require more data and training than less complex ones. The more specific the model needs to be, the more examples it will need for training. For example, a tool that only seeks to identify foods versus one that tries to identify objects will generally require fewer data.

The next question is how to prepare the data to ensure successful model training. The best way is as simple as involving humans in the loop who can label as much data as accurately and efficiently as possible. This way, learning is supported, possible deviations are corrected, and side effects of “solitary” learning are avoided.

The data labeling process often takes a lot of time. Depending on the scale of the project, a large amount of resources may be needed for proper data labeling and learning supervision, but it is the most accurate way to create a reliable and effective Machine Learning model.

By Julio César Blanco – August 12, 2022