In my last post but one, where I harped on about diversity, I also mentioned symbiosis. I might not have used the word, but the concept was hard to miss. This post expands upon this concept.
Interestingly (to me :-)), this was never my intention.
I wanted to address a very important, timely (and actually quite popular) topic: that of how horribly easy it is to stuff up the development and application of “intelligent” technology when it comes to the crunch of solving real world problems. Crucially of course, I would also like to provide a spate of practical (concrete, nugget-like, roll-up-your-sleeves-and-click-buy-now) insights into how we can rise above that problem (as opposed to postulatory, bell-clanging, suck-on-your-pipe-and-click-like insights).
It just so happened that, while writing this, I realized that a significant tranche of this insight boils down to observing the importance of symbiosis. This time I do not mean symbiosis between organisms in the natural world; nor symbiosis between diverse algorithms in an artificial system (you can read about those here). But symbiosis between us and the technologies we have at our disposal.
Without further ado, the two fundamental problems with machine learning and predictive analytics at present:
1) THE TECHNOLOGY IS IMMATURE
Analytics. and machine learning technologies (as commonly being churned out and oh-so-eagerly consumed) are hugely immature. A central philosophy here at Algorithm1 is that most things which can be done by a human can, in principle, be done better by a machine. That doesn’t mean they can all be done yet, and most almost certainly can’t be done by the impressively-named technology that is being waved in front of your nose this week, regardless of how easy its buttons are to push.
2) OUR ABILITY TO USE THE TECHNOLOGY IS (BY DEFINITION) EVEN MORE IMMATURE
The smartest and most conscientious among us attempt to make up for problem 1 by being… well… smart and conscientious in their application of the available technology. Make no mistake though, being that smart and conscientious is a full time job. Consider how infamous the field of Statistics has become, for the mis-comprehension, mis-application, and sometimes outright abuse of its methods. Even expert mathematians can get into heated arguments about the most basic laws of probability. The techniques applied in the fields of machine learning and predictive analytics are orders of magnitude more complicated than those of standard text book statistics, and many thorny concepts such as overfitting and regularization are unique to this domain.
Unlike with motor vehicles, which serve a fairly narrow purpose, law does not require developers of algorithms and analytics software to make a statement (let alone apply for and pass a test) declaring the robustness, fitness-for-purpose, or safe operating parameters of their products. And unsurprisingly (regardless how surprisingly), law also does not currently require those doing Data Science to hold a licence to drive them.
The key do building good models which make good predictions in use relies on two inseparable ingredients.
- Powerful technology.
- Appropriate selection and application of that technology, by smart, conscientious, knowledgeable people.
2 without 1 is a non-starter. 1 without 2 is liable to result in a train wreck. Ergo, symbiosis is the solution.
THREE EXEMPLARY CHALLENGES
EXEMPLARY CHALLENGE #1
- Intelligent algorithms struggle to deal with large volumes of data, and vice-versa. For all the harping on about Big Data, your choice, presently, is: A) Big Data, primitive intelligence, or
B) Small data, high intelligence.The state of the technology does not presently allow you to have the best of both words (and it may never do, for both things are proceeding at pace). Advances are being made, but be prepared to make compromises, and be prepared to be skeptical. Do not accept a proprietary solution – unless accompanied by a very good explanation – that simultaneously alleges to process the biggest data, and offer the deepest analysis, at anything like the most affordable cost.
WHAT CAN YOU DO?
- Find somebody who understands, and acknowledges these trade-offs.
- If the size of the data presents a problem, then you (or they) will likely want to consider using something like streaming or parallel algorithms, sampling, or both (somebody in the pipeline should be able to explain the pros and cons of the available approaches). Here your friends are technologies like Moa, Mahout, Hadoop, and their many peers and offspring.
- If the subtlety or structural complexity of the data is your primary problem, or you have addressed your data size problem by sampling, then you may well want to consider advanced machine learning techniques such as Deep Learning, semi-supervised learning, ensemble learning etc. “Big Data” technologies are of limited use to you here.
EXEMPLARY CHALLENGE #2
- Machine learning algorithms work with limited kinds of data. The vast majority expect data that is numeric or categorical, and neatly filed into a rectangular spreadsheet having a fixed number of columns. Therefore, even putting the problem of “Big” Data aside, in most cases substantial data preparation is required before machine learning and analytics tools can be applied.In order to get arbitrary problems into this form requires key data to be identified, re-structured, aggregated and so on – a process broadly known as feature engineering. This data preparation is invariably a manual and often an exploratory task, usually with no right answer, and with a lot of opportunity for poor decisions to be made. Increasingly, algorithms are being developed which can tackle this part of the problem, but there is a long way to go.
WHAT CAN YOU DO?
- Solving these messy problems is the art of Data Science. Make sure whoever is addressing your problem is qualified to call themselves a Data Scientist or a Data Miner. Database experts, statisticians, mathematicians and programmers all have invaluable skills, and if you can afford their dedicated services you will be in a better position, but none of these are sufficient in isolation. More than any software package, a good Data Scientist is the Swiss Army Knife of analytics.
- If the data you wish to analyse consists of anything more than numbers and categories: for example text, images or audio; then you need some combination of special tech and special expertise. The fields of Natural Language Processing, Signal Processing and Image Recognition are vibrant research fields for a reason. Do not accept the services of an individual, a company or a technical solution which doesn’t have something to say about these problems specifically. Moreover, depending on what your text and images represent, and what kind of information you need to extract from them, different approaches will apply: so, beware of solutions whose claims go only as far as “handling text” or “doing NLP”.
EXEMPLARY CHALLENGE #3
- While more sophisticated in many ways, machine learning solutions lag very far behind textbook statistics in the guarantees they can make about their predictions, or in what they can do when there is uncertainty in the data that is fed to them.Statisticians have long since acknowledged that numbers and outcomes can have errors or uncertainties attached. They have furthermore long since figured out that these things can be precisely quantified, and that they really should be before making any kind of deduction or choosing any “best” course of action. They have furthermore long since established that in order for these various things to hold true, certain assumptions – from a now quite well documented list of assumptions – must be declared about a given problem; and they know that if those assumptions are inappropriate, or are broken, then any outcomes (along with any cunningly calculated certainties attached to them) cannot be trusted for toffee. People may have been incorrectly imprisoned for smaller mistakes.
Now consider that the vast majority of machine learning algorithms have none of this due-diligence paraphernalia, and that we gladly plug them together like Lego.
WHAT CAN YOU DO?
Depending on how hands on you want to get:
- Stay away from autonomous drones.
- Make sure that your Data Science team includes a statistician.
- Use appropriate tools: don’t use Naive Bayes when input features are not independent; don’t use probabilistic (e.g. any Bayesian) methods when the training data is not guaranteed to be both randomly sampled and representative of the conditions in which the system will be used; instead favour statistically robust methods like SVM. Don’t use things because they “seem to work”: understand the appropriateness of the techniques. Where that is not possible (because nobody saw fit to put the information on the tin) – and in all cases regardless – run bullet-proof evaluations (rigorous cross-validation should be the starting point; thereafter throw some random noise in; dig out or concoct some unusual outlying cases).
- Don’t trust “accuracy” figures, in 90% of cases they are meaningless (see what I did there?) Look at precision and recall per class, balanced accuracy, and AUC. What are the baselines? What is the actual cost to your organization of a false negative versus a false positive? For regression models look at the error profile across the value range, not just a measure of average error like RMSE. Consider what you are actually trying to optimize for, and measure your success accordingly: do you actually want to predict which way the share prices are going to move, or do you really want to know which shares represent the best trades?