How Does a GPU Database Play in Your Machine Learning Stack?
Machine learning (ML) has become one of the hottest areas in data, with computational systems now able to learn patterns in data and act on that information. The applications are wide-ranging: from autonomous robots, to image recognition, drug discovery, fraud detection, etc.
At the cutting edge is deep learning, which draws its inspiration from the networks of neurons that comprise the cerebral cortex. These networks are massively parallel. As such, it’s no surprise that an increasing number of ML approaches are turning to graphical processing units (GPUs)—a key hardware component for general-purpose parallel computation.
Kinetica has been leveraging GPUs for massively parallel data analysis since 2012. As an in-memory analytical database, Kinetica is able to utilize multiple GPUs across many nodes to perform massively parallel statistical and analytical queries. Users can also apply custom code for analytical processing by leveraging user-defined functions, allowing Kinetica to integrate with a growing number of GPU-accelerated ML libraries, such as TensorFlow, Caffe, Torch, and BIDMach.
But this raises the question: if your ML library is already leveraging GPUs, what does Kinetica add to the ML stack?
Data, Tightly Coupled to the Model
Kinetica is tried and tested in large-scale enterprises, with production clusters deployed over dozens of nodes. At this scale most ML models are trained on subsets of the raw data, and most do not actually retain this raw data. Instead, they use the raw data to learn a state (e.g., the strengths of various network connections) before disposing of it—or siloing it in a data warehouse, never to be seen again.
With Kinetica, data can be stored in-memory and be rapidly accessed by the ML model as necessary. One key advantage to having the data closely integrated means that the user can always go back and fit their model as necessary.
Consider an example using time series data. It turns out that by learning the data in two stages—first forwards in real time but then again backwards—you will generally achieve a better overall fit to the entire dataset (i.e., Kalman smoothing vs. Kalman filtering).
To return to the neuroscience analogy, there is a close parallel to wake-sleep cycle animals. The networks of the brain are thought to learn online throughout the course of the day but require a period of sleep in which these model are re-fit to stored memories, most famously in the auto-associative networks of the hippocampus.
Feature Selection
Theorists in machine learning have long been aware of the No Free Lunch Theorem. Simply put, there is no magic algorithm that can perform any better than any other in general — that is, when averaged over all conceivable inputs. What this means is that ML models can only succeed to the extent they are well-constructed for the problem at hand. A model that has been developed for image recognition is unlikely to do well when applied to credit card fraud.
This is true even with deep learning. It is often asserted that deep learning is a fundamentally new innovation that solves the feature selection problem—that is, deep learning will learn features from raw data obviating the need for feature selection. Unfortunately, there is no getting around the No Free Lunch Theorem.
Let’s again consider the cerebral cortex. It is certainly true that the cortex is capable of selecting and refining features via feedback, such as in the the early visual cortex. But note that before even arriving in the cortex, visual information has been extensively filtered, such as in the complex circuitry of the human retina. And most of this is fairly hard-wired: if the rules of physics suddenly changed, your eyes would probably not be of much use.
What this means for ML is that models can benefit enormously from incorporating field expertise and the discovered insights of data scientists.
Here Kinetica is an invaluable addition to your machine learning stack. Kinetica will leverage GPUs to select user-defined features from the raw data to pass to your ML model. Further, Kinetica does not depend on intricate indexing and schema design, allowing the user to explore what combination of features improves their model’s performance. Feature selection can be performed by the data scientist, not preprogrammed by a database architect.
These features can take the form of statistical summaries, aggregations, and filters (i.e. standard database operations). Kinetica allows the user to apply any function to the raw data, including leveraging GPU-programming via CUDA.
Let’s look at a couple examples of feature selection:
Dimensionality Reduction
Most often feature selection takes the form of dimensionality reduction. Dimensions are simply the inputs to your model. In general, for any problem of moderate complexity, throwing all of your data at your model is generally a bad idea.
This owes to what is known as the curse of dimensionality — simply put, the more inputs given to a model, the longer that model will take to train. And lower training time often directly leads to improved performance — as the faster you can train and evaluate a model, the more models you can test and improve.
A corollary of this is that a large, complex model (or brain) takes significantly longer to do anything valuable that a less complex model (or brain).
Dimensionality Expansion
Feature selection can also be used for dimensionality expansion. Suppose you had only three dimensions (e.g. customer age, gender, location). Then it may prove useful to expand all your customers into a combinatorial number of categories (e.g., 20-something male in NY), and to train your model on that basis.
This is what is done in the cerebellum, a highly regular network that actually contains 80% of the neurons in the human brain. The cerebellum is thought to form a very simple model for each of many inputs, in contrast to the cortex which fits a very complex model to fewer inputs.
Whether dimensionality reduction or expansion, both of these examples can be performed in Kinetica using UDFs. And in the case of expansion, all this massive data can be stored in Kinetica, so that it doesn’t have to be recomputed every time the model is used. This brings us to the next major benefit of integrating Kinetica.
Storing ML output
All ML models can be seen as generating predictions (or fits). This prediction is then compared to the input—and if the prediction is wrong (or even if it’s right), model parameters are changed.
The following are a few reasons why you might want to hold on to these predictive outputs somewhere where you have easy access to them:
Improving your model. As mentioned above, you can generally achieve a better performance if you go back and fit your model periodically. This process can be aided if model outputs and errors are retained. Your data scientists may want to go back and look at how various models performed, and specifically on what subclasses of problems. For this, Kinetica (and its UDFs) would be excellent.
Unknown or delayed input. What if your prediction isn’t for something that will happen immediately, but rather something that may happen in months, or possibly never? These predictions need to be stored somewhere, and preferably somewhere reliable.
Uncertain input and data correction. Most input is not known exactly. In fact, the way we make observations in our brain is essentially Bayesian: that is, we don’t simply accept sensory observations at face value, but rather we combine them with our prior expectations to form our beliefs. Predictions, or priors, can thus be important in data interpretation.
Faster Training -> Higher Accuracy
Creating a strongly predictive ML model requires a great deal of iteration and testing — rarely do we know which features will be valuable ahead of time. And often those features we assume to be valuable turn out to be redundant and unnecessary.
The quicker your data scientists can train a model, the quicker they can refine and improve it. By training a model faster, it will be accurate by the time you need it.
Compared to CPU systems, Kinetica proves to be 10 to 100 times faster on the vector and matrix-based operations that comprise ML algorithms. This means must faster training, significantly lessening the time required to achieve the needs of the enterprise.
So there you have it. In upcoming blog posts we’ll be sharing technical demonstrations and examples of how Kinetica is being used to augment machine learning. But if you’d like to get an early preview, contact us, and we’ll be happy to give you a demo.