Machine learning - obstacles and limitations

If you believe the hype, machine learning is poised to disrupt and streamline many industries that are underpinned by data. It’s being utilized in everything from driverless cars to product recommendations and healthcare, and such solutions are massively improving the function and utility of companies that have invested time and money in exploring the possibilities. However, it’s important to remember that it is not a silver bullet. If a company is looking to implement such solutions, it must first be aware of various obstacles and limitations of machine learning, and the ways to overcome these.

Managing expectations - Don’t expect miracles

Because of the rapid progress of machine learning over the last few years, expectations of what it can achieve are often not in line with reality. It’s still a relatively immature technology that doesn’t just work ‘out of the box’. Every model is unique, and must be trained, which involves a lot of experimentation.

In addition, the process requires substantial inputs in the form of computational resources, data, and even manpower. Just because algorithms are making assessments, predictions, and recommendations, a human touch is essential to the initial set up, and to oversee and act on any outputs. Finally, there is no guarantee that the model will learn quickly or deliver precise predictions for complicated queries.

Its therefore very important to have a solid reason and well-developed strategy for implementing machine learning. In order to decide how much value will be added to a product, goals must be well-defined, and careful consideration given to what it will take to achieve them.

For example, you’ll want to plan for classification, clustering, regression and ranking before any code is written. Data collection mechanisms need to be put in place, with appropriate formatting. Data might have to be reduced or weighted using sampling or aggregation. It’s probably going to be necessary to decompose the data and rescale it.

On top of all this, once you have some output, there must a be a structure and culture that is comfortable with data-based decision making. Having the data is one thing, but understanding and acting on it is quite another. This is why having a good team in place is vital.

Finding Talent - A problem of supply and demand

There is a supply and demand problem in the world of big data and AI. Data scientists and specialist programmers are required to build and understand machine learning systems, but there are only so many to go round! The willingness of tech-giants like Google, Amazon or Tencent to pay astronomical salaries has meant that the cost of capturing talent has exploded. Six figure salaries are the norm.

While simplistic techniques can be picked up relatively easily (even from various free online courses), complex deep learning techniques require a high degree of specialization and years of cross-disciplinary training. It’s likely a team will need to be assembled who have experience in computer science, mathematics and relevant domain expertise.

These human capital costs add up to make recruiting in-house a major outlay, but luckily, there are also other solutions, usually more affordable.

Computational requirements - Feeding the beast

Even the best data scientists and programmers will only be as good as the infrastructure that they are working with. This is why it’s necessary to get the right set-up for machine learning to work properly.

Large scale data processing requires a lot of computational power, which demands a fast GPU or distributed computing. It’s obvious that the more power at your disposal, the faster it will be to train ML algorithms, and therefore iterate on feedback and learn from mistakes. Multiple GPUs can be useful for parallelization techniques on small neural networks, or for running multiple algorithms or experiments separately on each GPU. Speed is important in machine learning, as generally speaking, the smaller the interval for performing a task and gathering feedback, the better the algorithm will be able to integrate and adapt relevant memory pieces for the task into a coherent picture.

However, even with the latest powerful GPUs, there will be times when training a model could take days or weeks. When this happens, solid plans and structured timelines for projects will keep things moving forward, even in production environments.

It’s also important to understand how often the models will need to be retrained or updated. For example, if you’re receiving new data every day, but it takes a week to retrain the model, this quickly becomes problematic as the real-time accuracy of the model is called into question. Conversations between the engineering and business teams should establish consensus on how best to use the results and work it into a product.

In addition to finding the appropriate hardware acceleration for a ML project, storage solutions that meet the data requirements need to be carefully considered, taking into account data structure, digital footprint, elasticity and security.

This will likely depend on the type of data that is subjected to ML techniques. There are increasingly complex data formats like audio, video, social media, and smart device metadata that could all be very interesting to analyse programmatically, but require different treatments. If you start thinking about connectivity, and how datasets can feed into other, then you will need some serious computational power and storage capability!

For example, consider that billions of people leave a data trail on social media that is ripe for analysis, or that the number of connected devices is predicted to reach 75 billion by 2025. By using appropriate ML techniques to take advantage of all this data, you could gain some fascinating and actionable insights.

Data quality is vital - Put rubbish in, you’ll get rubbish out

A machine learning algorithm will only be as good as the data it is trained on. In practice, this means it only makes sense if you have a lot of data to work with, as sparse, low quality sets will not reveal much or could easily lead to misinterpretations. Cleaning the data and filtering the noise or bias is time consuming, but nonetheless essential to ensure accurate results. Data must also be transformed into a logical format for the algorithms to consume, and for data scientists to query, summarize and visualize.

If a supervised learning technique is being used, then it’s also necessary to have correctly labelled data. In order for the output to be accurate, the input needs to be robust. While there is an ocean of big data as a result of various collection techniques becoming ubiquitous over the last few years, it’s not necessarily adequately labelled, which can prove to be a significant obstacle.

In quantitative sets, answers can sometimes be calculated or intuited from the data itself, but generally labelled data doesn’t just occur naturally. For example, with image data, a collection of pixels that combines into a picture of a car is easily recognized by the human eye, but not by an untrained algorithm. A human will have to first label the data. Services like Remotasks are now popping up, which outsource this labour to low income countries.

Unsupervised learning comes with certain challenges too. The techniques tend to be more complicated, as the algorithm cannot rely on answers given in the training set, but rather has to come up with its own solutions, which requires an awful lot of data and a degree of trial and error. The model can be optimized via reinforcement, where suitable action is taken over multiple steps to maximize reward (or punishment) for a particular situation.

Most deep learning techniques still can’t be effectively applied to generalized problems, as they struggle with things they haven’t encountered in training and can’t transfer their solutions from one set of specific circumstances to something else. It’s therefore necessary to continually retrain the models, which often requires new datasets and resources, even when the usecases are quite similar. For example, an AI trained by machine learning on the game Starcraft can comprehensively beat the human champion, but without extensive retraining can’t play as all the alien races in the game, nor on all the maps, nor older versions.

However, work being done using synthetic data and transfer learning techniques looks promising, and could help overcome this limitation. Various projects have demonstrated that it’s possible to repurpose models, using knowledge gained from one task to enhance a new one, without building from scratch. This allows teams to explore and experiment with lower barriers to entry.

The Black Box - Explaining the machine

One of the main concerns around algorithms’ reliability is that it’s very difficult to understand how exactly they work. Advanced neural networks that employ unsupervised learning techniques are a black box, because while the inputs, weighting criteria and outputs might be clear, the reason why the models make a certain decision are not.

It’s exceedingly difficult for humans to understand the hierarchical layers of data that constitute a complex model. Researchers are leveraging experimental psychology to get a handle on why algorithms perceive the world in a certain way, and how they differ from humans. Approaches to the problem are similar to how scientists try to understand animals’ senses or a child’s developing brain.

There is a natural hesitation to allow processes that we don’t fully understand to take control of software and make important decisions, like driving cars, recommending medical treatments or executing financial trades. Indeed, in certain industries like banking or insurance, regulations will limit or prohibit certain machine learning algorithms, while things like GDPR also complicate matters with requirements like a ‘right to explanation’.

It’s important to think about how ‘explainable’ the models you build are, and whether it is necessary to build in rationales for predictions.

Machine Learning is challenging, but worth it!

The problems outlined above are significant, but can be overcome with the correct planning and implementation. The continuing progression of the AI industry means there are lots of reasons to be positive:

  • As machine learning becomes increasingly normalized and approachable, it will be easier for companies to form judgements on what can and can’t be achieved, and what specific solutions are most appropriate.
  • Market forces, expanding toolsets and automation should help with the shortage of talent in the field.
  • Technological improvements will drive the industry forward, allowing greater processing speeds that can accelerate training periods.
  • Innovative techniques and technology will reduce the need for massive bespoke datasets.
  • Results will become easier to explain, and there will be increasing acceptance that not every decision made by an algorithm needs to be fully understood.

It’s exciting to see the field of machine learning developing rapidly, as more and more companies and universities build out the infrastructure and theory. As nearly every industry can find a usecase for machine learning, various solutions will start to feed into each other, and more advanced models will be created as a result. It’s a truly disruptive technology that is set to reach into many aspects of our lives and shape our future.