Return to page

Start Getting Your Feet Wet in Open Source Machine and Deep Learning

 

 

At H2O.ai we see a world where all software will incorporate AI, and we’re focused on bringing AI to business through software. H2O.ai is the maker behind H2O, the leading open source machine and deep learning platform for smarter applications and data products. H2O operationalizes data science by developing and deploying algorithms and models for R, Python and the Sparkling Water API for Spark.

 

 

Talking Points:

Speakers:

  • Desmond Chan, Senior Director, Marketing, H2O.ai

  • Ian Gomez, Community Manager, H2O.ai

  • Amy Wang, Data Scientist / Sales Engineer, H2O.ai

Read the Full Transcript

Ian Gomez:

 

Great, we'll get started. Hi everybody, my name is Ian Gomez, Community Manager at H2O. Thanks for joining our Start Getting Your Feet Wet in Open Source Machine and Deep Learning Webinar. Today we have two great speakers for our webinar, Desmond Chan, our Senior Director of Marketing at H2O, and Amy Wang. She is our Sales Engineer and Data Scientists at H2O. Here are some logistics regarding the webinar. Please submit your questions throughout the session. You can submit your questions via the questions tab in your console. This webinar is also being recorded. A copy of the webinar and the slides will be available after the presentation's over. Without further ado, I'll hand it over to Desmond.

 

Desmond Chan:

 

Perfect. Thank you so much, Ian. Again, my name is Desmond Chan. I'm a Senior Director of Marketing here at H2O. Welcome to today's webinar. This is the first of a series of H2O webinars, and today's webinar is meant for the introductory materials. If you are an advanced user of H2O, this may not be for you. But we will have a series of other topics in the future for this webinar series for more advanced techniques in H2O and some other use cases.

 

What H2O is All About?

 

Here's the rundown for today's introduction webinar. I will go into our H2O company introduction for just a few minutes. Then, Amy Wang, our speaker, will go into H2O introduction and demo. She will go into detail about how to install H2O and a use case for flex delay prediction. She'll go over the use case, the data set, how to do data munging, and how to create a model in H2O. We'll leave sometime in the end for Q&A. Today we'll go over the AI platform at H2O, the core platform, which is the product on the far left of the screen. Our core platform is an in-memory distributed machine learning platform with visual intelligence. We are very much emphasizing on our visual intelligence for analytics.

 

In addition to our core platform, we also have Deep Water, which is our integration of all the popular deep learning frameworks, including TensorFlow, MXNet, and Caffe. Also, we have Sparkling Water; Sparkling Water is our H2O integration with Spark. Last but not least, we also have Steam, which is our software for operationalizing data science in your enterprise infrastructure. You can compare models and deploy models from within Steam.

 

I mentioned that we are open source, and this nature of ours actually drives a lot of community adoption. As you can see from the screen, our latest statistics show that more than 9,000 organizations and more than 83,000 individuals are using H2O. This is really tremendous for our company. We are well recognized by the press. Most recently, we are recognized as a visionary in Gartner Metric Quadrant and also a strong performer in Forrester Wave. In the last year, we got 46 press mentions on popular press such as Wall Street Journal and TechCrunch, and so on. Out of the Fortune 500 companies, 169 of them really love us. They are using us on a regular basis, and you can see that banks, telecom companies, and health insurance companies are using us as well.

 

I want to give you some anecdotes on why we are chosen for Gartner MQ, the Magic Quadrant, and also Forrester Wave. For example, in the Gartner Magic Quadrant, you can see that we have very high regard from our customers. Gartner also thinks that we are especially suited to IoT edge and device scenarios. For Forrester, again, they recognize that we have significant adoption by the large enterprises. When big data was just under rise, H2O was still only the best known for developing open source, cluster distributed machine learning algorithms at that time, and no one else had them. Also, Forbes recognized H2O as one of the top 10 AI technologies amongst other competitors such as Nvidia, well, not competitors, but people in the companies in the ecosystem such as Nvidia, Microsoft, Google, and so on.

 

If you're interested, you can actually go on our website, www.H2O.ai, to find more video talks from our customers from different verticals, such as auto insurance, commercial insurance, financial services, and retail, and so on. There are many of those videos on our website. If you are so inclined to know about our use cases, please do so. Again just as a reminder, we will have more webinars in the future about use cases in different verticals. That said, I want to hand it over to our speaker, Amy Wang, and she'll go into the technical details.

 

What is H2O?

 

Amy Wang:

 

Hi guys, this is Amy from H2O,  as mentioned before, I'm a Sales Engineer at H2O. My job primarily has been about installing H2O and big data environments as well as working through data science problems with customers. Please send in questions as I talk, and we'll try to see how many we can get to. First, we're going to go over basically the overview of what H2O is, some of the features available out of the box with H2O core. Then I'll go into an airlines demo showing you some of the features on a multi-noted H2O cluster.

 

H2O is an open source in memory computation engine, essentially. The entire platform is built to fit into a Jar file that you can launch as a JVM, and you can launch many JVMs so that you have a distributed platform. In each individual node, you have computation also paralyzed. On top of that platform, we basically built a few machine learning algorithms, including GLM, Random Forest, GBM. Deep Learning is one of the open source, neural network algorithms out there. That's free and easy to use from the H2O platform, so please try that. On top of that, we have an API built in, the REST API, where we built in R and Python client interfaces.

 

Instead of having to write Java code to core H2O, which is what H2O is written in, you can write R and Python code communicating what you want to do to the H2O cluster. I'll show you the R client in the demo later. One of the benefits of using H2O, in particular, is that you're able to use all your data as much as you can scale out the resources that you have on your environment. You can fit more and more data into that environment. It really scales as your data grows or you have more use cases. It's faster to iterate through all those different models, all the different techniques and get at the best model with the best predictions and the fastest predictions as well in production.

 

H2O Algorithms

 

These are a few of the algorithms I just mentioned a little bit. Basically, supervised learning is if you know the answers to your questions and you know the target to which you want to predict, be that be a binary as a flight delayed or not, or a regression, how many flights should we have in one day to meet demand. These are the type of questions that you can sort of answer with models like GLM. A naive base is used for binary text classification for tree based models. We have Distributed Random Forests as well as Gradient Boosting Machine. For neuro networks, as I mentioned before, we have Deep Learning.

 

Then for the unsupervised learning methods, this is typically clustering, doing dimensionality reduction on a dataset, or detecting anomalies. These things are things that you don't have sort of a targeted or response column to predict for. You just sort of want to gain insight on the overall dataset. For clustering, we have K-means, or for dimensionality reduction, we have PCA, principal component analysis. We also have something that's not really available in any other platform called Generalized Low Rank Models, or GLRM for short. This is sort of a general convex optimization method that could be used to do clustering and dimensionality reduction. You can use architects in GLRM to partition your data set into something that's similar to the K-means method. You can also use the to do dimensionality reduction to reduce the dimensions of your data set. Then finally, we have Autoencoders in Deep Learning, which is a simple one layer network that basically tries to replicate the data onto itself. It's a nonlinear transformation back on itself, and it could be used to do anomaly detection.

 

Accuracy with Speed and Scale

 

An architecture deck to look over our stack. We're able to sit on top of any sort of commodity hardware via Hadoop or any regular machines we launch as JVMs. But if you launch on top of Hadoop, for example, we have built-in readers to read in data from HDFS. If you launch on EC2 instances or you plug it into S3, we can read in data from S3 buckets. We also have a JBC connector that will allow you to query simple SQL tables into H2O, in-memory H2O framework. This is what you'll see that is the core of H2O. We have an in-memory framework. Once you read data into H2O, everything sits completely inside of H2O. All the work is done inside H2O.

 

We have a MapReduce framework that I mentioned before that's not Hadoop MapReduce, but our own flavor of MapReduce that's written in Java that's extremely fast. When you read in data, the data itself is also column or compressed so that it's expressed in a way that actually saves quite a lot of space in terms of memory in the cluster. You have more memory to do your work on. It's our special format of expanding or contracting those data formats. Then once you have the data in H2O, you can do a variety of things, feature engineering, munging the data. You can do simple things like doing unary or binary functionalities on a column in an H2O frame. Or you can do feature engineering. You can apply something like Deep Learning with deep features, which will extract features out of the neural network. There's quite a lot of different methods that you can use in H2O.

 

Then finally, the traditional machine learning algorithms you can build on top of the data that you read in. All of this you can do, again, as I mentioned before, via the REST API. We built the R package really early in H2O infancy so that our users, whose primary audience is the data scientists, could communicate without having to write Java code using R Syntax that they're familiar with to give commands to H2O to execute. I'll show you a diagram of that in a little bit.

 

Of course, we wrote Python about a year or two ago into that stack as well. We have a Python module that you can actually just PIP install, which is really easy to do. Then, of course, we have H2O Flow which is our web-based UI that you can use. Actually, it comes out of the blocks with H2O, and you can point and click through and manipulate H2O in a non-programmatic fashion. I'll show you a little bit of that as well. Then of course, Java Scholar, which is native language that we plug into. While you're writing this code on the front end, what's outputting in the backend is Java code. It's Plano Java Object, which is our scoring engine. This scoring engine can be taken completely outside of H2O into any production environment that can basically read Java code, or we can write a C-interpreter on top of that as well, and we can get into that in a separate webinar if you have questions. We can also go into that a little more.

 

Reading Data Into H2O with R

 

As an example of how the REST API works, if you were an R user trying to import a dataset into H2O, just to look at the data, you would do a simple R function call that's H2O that imports the file. Instead of reading that CSV, you would do H2O that imports the file, and again, to supply the path of the data to the function call. What happens is this R function call is carried over the HTP network via the REST API to one of the H2O nodes that it's connected to. The H2O node will parse that information as a command to import data from HDFS with the path that you've supplied. It'll tell all the other nodes on the H2O cluster to do the same. All the clusters simultaneously hit HDFS to request for the same piece of data.

 

As we're reading in data, what we're doing is that we're reading in the data as chunks. Chunks of data are being read in one at a time, in parallel. But chunks of data are sort of being read into each of the nodes until there are no more chunks to share across nodes essentially. Once we've imported that data set as a parallel and distributed data frame, what's returned to the R user is not the entire frame. We only return a summary of that dataset. That's a distributed frame. We return a summary of that to R user so they can look at it like it's a regular data frame, truncate it a little bit, but they understand what the dataset look like, and feel like and they can always query the entire dataset, which is available in the H2O cluster.

 

Data Munging in R

 

These are the functionalities that you would expect to have parody of in H2O and R. This works for Python as well. Instead of using read.csv in R, you would use h2o.ImportFile in H2O in R. Instead of using summary, we actually overload the summary method, but things like cbind and rbind on a data frame, we've pre-printed h2o. so you have to do h2o.cbind or h2o.rbind on a H2O frame. Unary, binary operations are also overloaded, so we'll detect whether or not it's a data frame or an H2O frame and use appropriate methods. Then for the algorithms, instead of using GLM, Glmnet, you have h2o.glm. Instead of using Random Forest, you have h2o.RandomForest.

 

Analogous to a lot of the algorithms are currently available in Base R. Again, a lot of the functionalities are just prepended with h2o. so predict, h2o.predict. There are functions that you would have in other R packages that are not supported by H2O. There are ways to bring back some snippets of the data set that can fit back into R and use whatever R packages you want. I can show you how to move data between standard R and H2O NR, but for the most part, you want to do as much of your big data work in Syntax that's built for H2O NR.

 

Just before we get started, if you guys are interested during this call or after the call, it's really easy to install in R and Python because we've submitted the package to CRAN and Web PyPi. What you can do is just in R, you can do install.packages("h2o"), and if you're using Python, for example, you can do a PIP install of H2O. If you have Conda, there is a Conda install H2O as well. There's also a webpage with more instructions on how you can install on a multi-node situation rather than just on your desktop as a trial.

 

Demo

 

I'm going to go into my demo. I'm going to exit my presentation here, and I'm going to pull up my R session. For this particular demo, I'm going to use an airline data set based out of Chicago, and I'm going to use Chicago weather, what we wanted to predict for weather based airline delay, so whether or not a flight will be delayed based on the weather. We don't care about whether or not a flight might be delayed because of, let's say there was a logistics issue or there was a bad global event. This is strictly dependent on weather so that you sort of narrow down the scope of your problem to something that's solvable. We'll start with that. Then we start by loading the H2O library, and I have an H2O cluster up that's only two nodes, 10 gigabytes on each one, but I can connect to the remote H2O cluster from my local R session. Right here, when I click connect, you can see that it's not a single noted H2O instance. You can see that 2 nodes, about 20 gigabytes across the entire cluster and about 64-cores that are available for the H2O instance to use. It's a fairly hefty cluster. What I'm going to do here is I've specified some HDFS paths, and I'm just going to define them, and then I'm going to read it in with the h2o.import file.

 

While that's importing, it's going to import relatively quickly. I think it'll take about two seconds here. I'm going to bring you over to Flow, which is our web-based UI, so that I can show you, even though we're doing work on my local machine and you're looking at it through R, there's different interfaces because it is through the REST API. There are different interfaces that you can hit. I just imported a data set. I can actually hit get frames in the web UI, and then you can see the two frames that I just imported. This flight data set on this is about 1 gigabyte, 1.2 gigabytes read into memory. We've compressed it down to about 373 megabytes. You can see it's about 12 million rows by 31 columns, and the weather data set is smaller. But we would join the two to essentially grow the original data sets to have more features.

 

You can click on the individual frames itself and the columns to see the distributions of the columns as well, as you can convert some of the numeric values to enumerators or factors and vice versa. There's ways to build models straight from Flow instead of R. You can hit build model here, or you can split the dataset right from the web UI. But today, we're going to go through the R interface. The same way we can describe the data in Flow, you can also hit h2o.describe, and you can look at the min, max, mean of that dataset. The cardinality of enumerated columns, how many missing values are there in that particular column, how many of them are zeros that could affect which algorithm you choose, which parameter for each of the algorithms that you choose.

 

There's also some simple plotting functions like a histogram. We can bin some of the information from, let's say, the year column. Now you have all the flights per year that's in the data set. You can see a bit of a gap in the middle here because we took out one year. We can do some other plotting functions where I'm munging an H2O frame. This thing is a huge H2O frame that is truncated. You see only the first six rows by all the columns, but it looks like it and feels like a data frame. You know it's an H2O frame when you see it's truncated like this. I'm just going to do some simple work to define whether or not a flight is delayed based on weather. I'm going to do a group buy, which is a really fast H2O like ddply like function that allows you to do aggregations.

 

You can see that the aggregation across one 12 million rows with 20 different subsets took about one second to do. What's returned is a relatively small H2O frame. It's 20 rows by four columns. This is what I mentioned before, something like this that you finish aggregating upon, you can bring back into R so you can do an as.data, frame to translate this H2O frame back into an R data frame. I'm just going to do that. From there, I can do whatever I want. That's a typical R command. I'm plotting the results of that because I can translate it. You can see the blue line here is the number of total flights. The red lines are the total number of delays. All the way at the end here, it's a little purple bar that's specified weather based delays. You can see a lot of the other years back didn't even track whether a flight was delayed based on weather. We're just going to subset that, and we're only going to take the ones from 2003 to 2008, which actually have weather based delays. We're going to take everything in the blue bar, everything in the purple bar and try to predict for the low purple bar essentially.

 

Here we're doing some simple filters. If you're an R user, you'll understand the Syntax. It's just you have a frame, you want to filter by a column and value that you want to filter by and you reassign it back to the original variable. The Syntex itself, if this was a data frame, would be exactly the same for the same type of operation that you want to do. It's really easy to get started. We've just mung the data set. Then we have an overview of after we filter it, we have about 82,000 weather based delays out of 2 million, and we want to predict for that and get a decent ROC curve and AUC value. You can see I have a few parameter creation functions here.

 

This is basically taking an H2O frame. I'm going to do a modular mod, and then we can create a new time variable that we append back to the original flight's hex frame. The flight's hex frame, which originally had about 31 columns. You can see I have appended three new columns here, and then finally, travel time here. Then finally, we're going to build our machine learning model. I'm going to set my Y-predictor, and my X-predictor, my predictor variables, and then I'm going to predict for is arrival delayed based on the year, the month, the day of the month, the day of the week, which carrier United versus Delta versus JetBlue. Where you're flying from, which in our case, all the flights are flying to Chicago, so we didn't put the destination right. We have the travel time, how long you spend in the air, or the distance of that flight essentially.

 

We'll do a test train split so that we can verify the error on the training set is representative of new data coming in on the validation set. Then we're just going to call h2o.glm right here with X, Y-variables. The training set, the validation set. This is a binary classification problem. The family is binomial. We have some regularization here as well as turning on lambda search to search through different lambda values for that regularization path. While I'm doing this, let me pull up Flow again and I'm going to pull up the water meter. You see each box is one machine on the H2O cluster, and the individual blue line is a core on that machine.

 

What you should see when I hit execute to build the GLM model, you should see all the both boxes spiking up in green, meaning that the cores are busy all at the same time and reducing at the same time. You should see that on the side here. We're going to build a GLM model, and then we're going to build a GBM model with about 50 trees with the same parameters. This is an easy way to try all the different algorithms with different parameters and see which one works best for your dataset. It depends on how sparse your dataset is. Categorical versus numerical. One algorithm might work better than another. An easy way to find out the behavior of the algorithms with the dataset is just simply to try it.

 

With H2O, you can basically try GLM, GBM, Deep Learning, Random Forest in a matter of seconds and see generally out of the box which one does better. I'm going to report back the AUC on the GLM and GBM model. You can see that GLM didn't do as well as GBM. There's no overfitting in GBMs case either because you can see that the AUC value on the training set and the validation set is fairly similar. It's a little lower, but we expect that out of the validation set. We want to know if we can improve upon it. This is just a regular data set without weather data. If we merge that data with the original data with the information, whether or not the new information can improve this original AUC value.

 

Here we have h2o.merge, this is written by Matt Dowell, who wrote data.table, and it is a very fast merge that you can use a multi node or a single node. We're just going to merge by the time of the flight. Day, month, year, and we can look at the merge data, and you can see instead of the original 32 columns or the weather data set, which is 27 columns. It's going to be the combination of the two with 48 columns with all the original 2.2 million flights that are in our original training set.

 

Of course, you can view it from Flow as well. You can see the new frames that you created as well as the models that you build, and I'll show you in a little bit after I've built the weather based GLM and GBM model. I'm going to split the data again this time. It's the new frame that we created. Again, define our response variable to be the arrival delay. But the predictor variables will have new incoming data from weather hacks, essentially. You can see the new feature set is much larger than the original. There's about 31 predictors. Then we're going to build a GLM model and a G BM model. Again, you can see the core sort of working away.

 

Funny enough, while this is working, I can go into H2O and grab the model that is actually currently being built. GLM is being built right here. It hasn't finished the full regularization path, so you don't see it yet, but you can already see in this current iteration what is the coefficient of that model. If you like what you see, you can also actually stop the model build in the middle, and the model is ready for you to take into production with Java code that's also created on the fly. The GLM model, and I can refresh it, and you should have the ROC curve. You should have the ROC curve on the training set, which you can already see it's better than the 0.7 that we got before. It's now 0.8 with the new features.

 

Again, on the validations that you have, you see a similar AUC value, and going down some more, you have the magnitude of the standardized coefficient magnitudes for that model. There's a few more metrics that you can play around with, and again, the POJO is available for you. If you want to see really quickly, I can get models again. You can look at the GBM model that we just finished building as well. It looks slightly different because it's going to have the ROC curves again, but you can see the AUC value in this particular case is much higher than the original GBM model as well. You also get a scoring history, which is particularly useful if you're worried about things like overfitting on your dataset.

 

You can see how well a model did on the training set and the validation set over the number of trees. Let's say you built a thousand trees, but you noticed that it starts to overfit near like 50 trees. You can actually prune that model early and take that model out instead of just letting it continue building and start to overfit, essentially. It's a good tool to combat or diagnose overfitting issues in tree based models like GBM or Deep Learning, which also could have an overfitting issue as well.

 

I'm going to report back the AUC in a new table so we can look at it and compare the AUC tables. You can see here in the original dataset, the GLM has 0.74. Now, that's 0.85. GBM originally had 0.84. Now, it's 0.91 almost. You can see that it's done significantly better to add more features in. In particular as a data scientist who needs to translate these models into business value, you want to be able to interpret the model. You can say that the AUC represents how accurate this model is in general. Between the value of zero, one sort of like 70%, 80%, 85% to 90%, which looks good. But interpreting the model itself is a little more difficult.

 

We can do that easily with, let's say, a GLM model, which, if we go into a GLM model, you see the coefficients for each of the variables that we use. If you see right here, it's sorted by most important or the most effective to the least. You can see a categorical column with the values, rain, hail, and thunderstorm would obviously have a positive coefficient that's relatively high in this dataset. Which means as you have this variable, if it's raining and hailing and thunderstorming, you can expect a weather based delay on your flight. These are relatively easy to interpret. Something like a GBM model, for example, what you have is variable importance.

 

You can see how important each of the variables were in terms of improving the accuracy on the model. You can see that origin was split by multiple times more frequently than maybe other variables. Each time it was split by the origin point where you're flying from, you get a decrease in your loss function, which is an improvement on your model. Another way to interpret it in H2O is we have something called partial dependence plots. You can grab the variable importance and look at the most important variables, and then choose a few columns to calculate partial dependence plots. Here I picked visibility by miles, precipitation in inches, the max temperature, or the mean wind speed in miles per hour.

 

If I calculate the partial pendants plot, and you can see all the jobs are happening in Flow as well. Let me see if I can grab that. You can see the partial dependence plot being calculated here. It's still running, and you can refresh it as it goes along. This is typically for four features. I think it will take about 20 to 30 seconds because it's scoring repeatedly over the same dataset with different values for each of the columns. It's a little more computationally intensive. Once we finish that, what you have are plots that you can view. For example, we have the mean response versus the visibility by mile. As you have less than less visibility in front of the plane. If your visibility increases, the visibility in front of you increases you should see that the response for rival delay will decrease. You're less likely to be delayed if you can see what's in front of you. You can see that if it's raining more and more, you're going to be more likely to have weather based delays. This is another way to interpret traditionally not easy to interpret machine learning models. This is sort of an area that we're working really heavily to try to improve. You can see you can interpret things like Deep Learning models that have really complicated networks. You can explain a GBM model and the features going into a GBM model a little bit, even though this is going to be 50 trees or a 100 trees shallow trees that is usually not easy to interpret. With that, I think I am done with my R Flow demo.

 

Q&A With H2O Team

 

Ian:

 

Let's go ahead and transition into Q&A. We have some great questions in the questions window. Let me go through them real fast. The first one is, what is your business model?

 

Desmond Chan:

 

Ian, let me take that. Our business model as we are an open source company, and we offer enterprise support. The measurement of usage is based on the terabytes of data that you have, and also we offer different types of SLAs according to the tiers of support that you get. If you have any more questions about our enterprise support, please just reach out to our sales. The email address is sales@h20.ai.

 

Ian:

 

Thanks, Desmond. Here's another question. What does your roadmap look like?

 

Desmond Chan:

 

Again, let me take that. There are many exciting things going on. Just to give you a preview, in the near future, we will have more announcements about our GPO enablement, meaning the machine learning algorithms being very integrated with GPU and run accurately in a speedy fashion on GPUs. Also, for Steam, we are working on securing the Hadoop infrastructure for Steam users. That is another thing that you might want to look out for. Going forward, our ultimate vision is all about accuracy, speed, and model interpretability. You must have seen our block on model interpretability by our coworkers Patrick Hall and Wynn Fan and also our CEO, Sri Ambati. If you're interested in model interpretability, that is also one direction that we're going in. Hopefully, that gives you a preview of our roadmap. Go with the next question, please.

 

Ian:

 

Thank you so much, Desmond. Let me filter through. These are some good ones. Do you hold actual face-to-face meetups? We do; we are growing our meetup communities throughout the world. I know that I posted this in various cities in the US. We are looking for community champions in those cities, so feel free to directly email me at ian@h2o.ai if you're from one of those cities that we posted this meetup webinar at. Thank you for that question. Next one. All right, so does H2O have support for SVM?

 

Amy Wang:

 

Yes, H2O out of the box has the core algorithms we mentioned before. But SVM is not one of the algorithms out of the box in H2O, but it is with Spark. You can download H2O Sparkling Water installation, and you'll be able to basically use SVM in Sparkling Water.

 

Ian:

 

Awesome. Next question, what is the open source licensing of what was demonstrated?

 

Amy Wang:

 

It's just an Apache V2 license. Tom, do you have anything to add to that?

 

Tom:

 

No, Apache V2.

 

Ian:

 

Let's go on to our next one. Do you have support for CNNs convolutional neural networks?

 

Amy Wang:

 

Again, H2Os Deep Learning is just a deep forward neural network, but there is a Deep Water that is being currently worked on that will allow you to use other Deep Learning or neural network platforms, which will include convolutional neural networks. You can use MXNet for example.

 

Tom:

 

MXNet. TensorFlow. There were a number of questions about that and also about GPU support in general. Just keep an eye out for a future webinar, which we'll get into in more detail. You can check out the projects that are being worked on GitHub. The Deep Water Project is also open source. It's on GitHub. You could go take a look at it. There's an Amazon machine image that you can use, which is probably the easiest way to consume it today. If anybody is interested in that, I'd encourage you to go look at that as our way of delivering that to you.

 

Ian:

 

Great. I'll get to another question we have some time.

 

Tom:

 

There was one here about a particular regression model that I'd never heard of. LOESS, I apologize. I don't know what that is. I guess that means we don't support it, most likely. That would be something that we'd have to look into. The person that asked about that, if that's really interesting to you, then write to us and we can follow up on that.

 

Ian:

 

I have another question. Is there functionality for running streaming predictions with live data?

 

Tom:

 

That's a good question. Yes, you can take the H2O models and in the POJO or the MOJO form and actually embed them in lots of different environments. An example of a streaming environment would be either Spark Streaming or Storm a storm bolt. You can take H2O models and very, very nicely embed them into environments like that because the final POJO or MOJO model has very little software stack. It's an extremely, extremely thin stack. It does not require the full H2O to be running to make predictions in a streaming environment. Those are great deployment options, especially for real time use cases.

 

Ian:

 

Awesome. Thanks, Tom. Let's go through some more questions.

 

Tom:

 

Can H2O get data from MySQL? Amy, you want?

 

Amy Wang:

 

There is a JDBC connector. If you take the MySQL driver and launch H2O with that driver, you should be able in R, I didn't demonstrate this, but there is a function call in R Python h2o.importtable or h2o.importsSQLtable that will allow you to basically query a table out of a database and into H2O as an H2O frame in memory.

 

Tom:

 

There were a couple of questions here about Azure. Do you have an Azure image? This one's interesting. Is H2O like MsAzure, but with clusters? Do you want to say a few words about Azure, Desmond?

 

Desmond Chan:

 

With Azure, we do have a partnership with them, and we do have our H2O deployed in a container on Azure.

 

Tom:

 

In the Azure cloud. H2O is just a Java process. It can really run anywhere. It can run on all of the different cloud frameworks. It can run on C2. It can run on Azure. It can run on all of them. It is possible to run there and a convenient way of doing that is actually in the works and might even be available. We'd have to check on that if it's available for everybody today. If it's not available today, it's one of those things that's actually under development. Keep tuned for that.

 

How to import R libraries into H2O that's an interesting question. What is probably the right way to answer that is to say what we've done with H2O is we've taken a set of core algorithms and rewrite them in Java to be parallel and distributed. What we don't do is take all of the thousands of CRAN packages and magically run them in H2O in a parallel and distributed way. We've taken very specific ones like GLM and Random Forest, and Gradient Boosting and decided to re-implement those in Java in a scalable way and focus on those. But what's nice is you can also move data back and forth between your desktop environment as well as the H2O cluster environment. You can use those different packages side by side. You can actually use the R packages that you're used to using side by side with the H2O package. It's only the H2O packages actual capabilities, though, that will be parallel and distributed on the H2O cluster.

 

Amy Wang:

 

Just to add to that, the unary and binary functionalities are there. If you have something simple that you want to implement, you can actually write R user defined functions using H2O on the backend. You just have to be sort of clever about it, but it is doable, and you can sort of re-implement a lot of the functionalities out there in R.

 

Tom:

 

Here's a fun question. What does MOJO stand for? My understanding is that it mostly, it's just that it sounds cool. I think it actually does stand for model java model object optimized. Is that right?

 

Amy Wang:

 

Model object, Java. Yeah, I don't know where the Java sticks in there.

 

Tom:

 

I think it's model object optimized. But honestly, mostly, it's just because it sounded cool.

 

The difference between POJO and MOJOs is probably worth spending a moment on, just because we want to differentiate between the two. The POJO are generated code representations of models, and MOJOs are generated data representations of models. You can look at the POJO and you can look at, for example, a tree model, like a gradient boosted tree model, and look at the POJO for that. It's quite easy to see the different decisions being made because it's literally if, then, else code that's generated for the implementation of the model. With the MOJO, instead of generating code for the model, we generate a data representation of the model, and then there's a little interpreter that knows how to walk the representation and take a new row of data and make a prediction for it. The two different modeling implementations, POJO and MOJO, have different trade-offs and different strengths and weaknesses, and it just sort of depends on your environment. But they're both very good for real time use cases. They're both excellent for that.

 

Ian:

The other question is do you have use cases for MOJOs to do anomaly detection and data quality management?

 

Amy Wang:

 

Just more questions about MOJOs, but we have the MOJO right now for all the algorithms in particular. I don't know if we can build in the anomaly detection, the auto encoders from deep learning into MOJO as well.

 

Tom:

 

I don't know if it's there actually today; we'd have to look. It's definitely in the POJO, though. If you want to use that, then that's definitely available. The intent is that, eventually, we'll put everything into MOJO. We just sort of do it according to who's asking for what and when. We absolutely intend to make everything available in the MOJO form factor.

 

Amy Wang:

 

It's just that we work on these demands based on the frequency of the requests as well as who makes these requests. Whether a customer needs it and it's a blocker to go into production versus it's an open source request from the community.

 

Tom:

 

The short answer, though, is we will absolutely have that. That's absolutely in the plan.

 

Here's a question. Do you have support for the Lua programming language? And we do not. That's not something that we have today. We've got R, we've got Python, we've got Java, Scala, and those are the primary set of languages.

 

Are we planning to have a physical meetup in location X? There were a few of these. Maybe the right way to answer that is to say on our website, there's an events page and that has a list of where the upcoming meetups are. I think Ian might have touched on that, and I'm just reiterating.

 

Could you point to some resources for learning further about H2O? The best resource to start at is the H2O documentation page. Actually, it might even be a good thing to bring up and to show everybody. At the docs.h2o.ai page, you will see an overview of all of the documentation, and then you can drill into various parts depending on what you're interested in. You can zoom in on the algos. You can zoom in on the H2O user guide. That's actually the quote-unquote main H2O documentation, the H2O user guide. We've been putting in a lot of effort to try to make this more and more comprehensive over time. The most recent addition to this has been an extremely detailed parameter documentation for gradient boosting, for example.

 

You can find a very, very thorough parameter by parameter example for gradient boosting, where there's a code snippet for each parameter and as well as a description of what it does. I would say this is a great resource. If we just back up to the main page again, there are other high-level topics like grouping by language grouping by tutorials. Many of the conference presentations and meetup videos are available here under the presentations link. For developers, there's developer documentation and then architecture and security and productionizing recipes for enterprise folks. This is the place that I would start if you're looking for H2O documentation all in one place. It's docs.h2o.ai.

 

Amy Wang:

 

On that note, I think we got a question about how it is different to run H2O online or on your desktop. On your desktop, you have your laptop resources, the amount of memory in CPU that's sort of limited to what a laptop can bring. Online or on a Hadoop cluster, you usually have more resources. You have more notes, more CPUs. You have someone to do security around your data file system. You have people to manage those resources, things like Yarn. Typically, we try to plug into wherever you have a big data environment. To get started if I didn't mention it before, there is also an h2o.ai/download page that you can hit. If you go into the latest stable release, if you get started, you might want to download just a JAR file, or you want to just install an R, Python on your desktop to try the functionalities. But eventually, if you want to install on Hadoop, for instance there's sort of a different way to install. The functionalities will essentially be the same.

 

Tom:

 

Here's a question. Does H2O have support for neural networks? I'm sure that Amy mentioned that, and yes, there is a Deep Learning inside the core H2O. Core H2O has a feed forward neural network framework with back propagation. It's a CPU only framework. That's going to take advantage if you've got a distributed Hadoop environment, for example. You can run Deep Learning in just regular core H2O on your CPUs in your cluster. From the standpoint of our next generation Deep Learning, and taking advantage of GPUs and doing things like CNNs, for example, that's in Deep Water. You can go check that out and we'll put that in. I would say that it belongs in its own webinar as a future. That's a very good topic to cover. I think we're getting close to running out of time here.

 

Ian:

 

I guess that will wrap up our webinar today. I know a lot of you have asked if there's going to be the slides and recording available. Yes, they will be on YouTube in about 24 hours, and the slides will be posted to the respective meetup groups as well. We'll follow up with a survey just to get your feedback on our webinar and future topics. Special thanks to Amy and Desmond for presenting today. If we weren't able to answer your question, and Tom, if we weren't able to answer your question, please ask on stack overFlow.com. If you haven't tried an H2O yet, feel free to download it on our website on h2o.ai. Thanks again, everybody. Have a good day.