Insights From The Blog
The Promise and Perils of Synthetic Data
There is a huge amount of work going on at the moment in training AI systems to give more accurate and ‘human’ responses to prompts.
Thousands of technical writers – myself included – worldwide are creating prompts for AI engines to respond to, and then tweaking those responses until the AI starts to fail. At that point, the engine is told why it has failed and from that, it learns what is right and what is wrong. The end goal is the development of AI systems that are not only knowledgeable, but are able to distinguish human traits such as true statements, and factual information rather than something it has made up but that fits the prompt. At that point, we could construe AI to be truly intelligent.
How to Train AI
This AI training is often seen as being a safeguard since the responses are being overseen by a human, but there is a growing section of the technological community who believe that this work would be enhanced by allowing AI to train itself. They say that it is now sufficiently sentient to be able to manage itself and be trusted not to add fabricated or incorrect data to the mix.
With AI in the driving seat, gone are the bad old days of “rubbish-in/rubbish-out” computing, and we already know that AI systems are smart enough to manage themselves. Silicon chips made by the world’s major chip manufacturers are now typically designed by bespoke AI systems, and have architecture that is so complex and off-piste, that engineers don’t really understand how they work. Plainly AI is capable of doing its own thing, but computer chips are markedly different from general or technical knowledge, and should we trust AI to train itself to be better?
While AI engines might be impressive in their responses – just type a question in Google and get the AI overview – they are still just a machine that handles statistical data. They just do it in a very fast and user-friendly way. They are trained to learn and understand patterns in data and can sort these inputs in different areas such as people, ideas, and places. They break down a question into pertinent parts and then form responses to all of the major points requested. Key elements to the data used are annotations.
Annotations Are Key to Machine Learning
Annotations are text labels that explain what the data these systems consume actually means and how it fits within the context of the data. They act as markers, “teaching” a model to differentiate between locations, objects, and concepts. The annotations help the system identify and decide what something is by association with standard things and equipment that would normally be found in the environment.
For example, take a photo-classifying model that was presented with a large number of photographs of kitchens that were tagged with the phrase “kitchen”. During the course of its training, the model will start to form connections between the word “kitchen” and typical characteristics of kitchens, such as an enclosed space that houses refrigerators, stoves, and worktop counters.
Following the completion of training, the model should be able to recognise a kitchen that was not included in the original examples if it is presented with a photograph of such a kitchen. While it might recognise that a kitchen has a high-degree of flexibility in its structure and contents, generally – as with humans – we can look at a particular area within a house and decide pretty quickly whether it is a kitchen or some other room.
As you can see, annotations become hugely important since they form the backbone of the information that the AI assistant needs to make an informed decision. It also follows that if the annotations are less robust can add to confusion in the AI system, and lead to poor responses. It turns out that AI is intelligent, but only if it has been trained to be so.
Standard Responses? No!
The importance of both annotations and training regimes was recently highlighted by a report which showed that different AI systems will present markedly different responses to the same questions, and increasingly so if the question is subject to interpretation. Ask any AI to describe the impact of Newton’s Second Law and you will get pretty much standard answers. However, if you ask something which does not have a definitive answer – such as immigration issues, LGBTQ+ and politics – can lead to an array of, sometimes, conflicted answers.
Another report found that a trained Chatbot tended to deflect and refuse to answer questions that it deemed were too controversial to answer with any real authority. While that might seem to be an essential attribute of an intelligent system, it can be pretty frustrating for anyone looking to get answers.
Speed is an issue
One of the biggest issues in AI training is the speed at which humans can do it, and the inefficiencies associated with human inputs. Here are a growing number of AI systems coming on line, and they all need to have a robust training regime. Added to that, the information pool is expanding at a growing rate, making keeping up with it all very difficult. Information means data far beyond a series of facts; information is also recipes, celebrity news, images, daily news, music, and just about any other snippet of data that can be absorbed, stored and acted upon.
The sheer mass of information means that it is difficult for human curators to keep up and that is becoming a problem. AI is only any good if the information that it has is up to date, and making sure that happens is becoming more difficult.
This impasse has led to the realisation that the only credible way to train AI is to use AI itself, but that has led to disquiet in some sections of industry.
AI systems have the physical speed to ensure that the systems they are training are kept fed with up-to-date information and are able to properly annotate that data so that it makes sense and keeps to guidelines. The main problem is whether the information is correct. If we allow AI to train AI, will we be, once again, arriving at a situation where we have “rubbish-in/rubbish-out” computing?
The main concerns are not that AI may become vindictive and, given free-rein, start to manufacture false data, but rather it may arrive at a similar place by erroneously extrapolating what it knows to reach false conclusions. Once that happens, then the situation becomes self-feeding and the volume of untrue or distorted information becomes greater and greater.
Of course, we can employ people to check the quality of the information that AI is using to train juvenile systems, but that takes even more time to rigorously carry out than to input the data in the first place. And once incorrect information has been uncovered, eradicating it from every AI system using it would be a monster task.
AI is learning fast, and allowing completed systems to train new ones is an obvious way forward, but one that we need to approach with some caution to ensure that we don’t create a Frankenstein’s monster that ports information that cannot be trusted. Time will tell on this one.
Unity Developers are the UK’s premier Game and App development team. Come and chat to us about if you have a project that you would like help with developing.