Our goal at Fivetran is to make data science more accessible by automating the grunt work of data engineering. Our counterparts at Google and Looker have made it even easier to ascend the hierarchy of data needs by providing machine learning through BigQuery and LookML.
As Abhishek Kashyap noted at Cloud Next ‘18, data scientists are scarce and moving data at scale in and out of BigQuery (or any other data warehouse) is a pain. BigQuery ML allows you to build machine learning models in SQL and keep them in the same environment as the original data. Likewise, Looker allows you to keep all of your analytics code on one platform.
While you won’t be able to run convolutional neural networks or other forms of deep learning using BigQuery ML, you can create simple prediction (linear regression) and classification (binary and multiclass logistic regression) models, as well as k-means clustering models. Regression models are commonly used in both industry and academia because of their interpretability and fast execution.
The Basic Machine Learning Workflow
Your basic machine learning workflow looks something like this:
- Collect data
- Generate a hypothesis and define features
- Prepare a training set
- Assemble a model
- Validate the model
- Export the model
The majority of a data scientist’s time is spent acquiring and wrangling data; this includes both steps 1 and 3. With Fivetran, BigQuery ML and LookML in the mix, steps 1 and 3 become greatly simplified. Step 6 becomes wholly unnecessary, as you can build, test and save your models on the same platform as the data.
Let’s walk through an example.
It can be hard to come up with a hypothesis in a vacuum, and “common sense” will only take you so far. Your goal is to see how some important metric, called the dependent variable or target — say, revenue, renewal rates, page views or sign-ups — appears to relate to other factors, called independent variables or features, so that you know what to influence in order to achieve a positive outcome.
Choose an explore and thoroughly explore it for associations between variables. The most straightforward approach is to perform pairwise comparisons between every pair of variables that plausibly shares some kind of relationship. Scatterplots are the simplest way to visually explore these relationships:
Positive, zero and negative correlations, courtesy of Free Code Camp.
In a Looker Explore, you can add trendlines to your visualizations by using the EDIT button in the upper right of the VISUALIZATION dropdown, selecting the Y tab, and scrolling down. Note that in order to draw trendlines through scatterplots, you must have the appropriate admin privileges. One of the assumptions of multiple linear regression is that the associations between the target and features are, in fact, linear, so it helps to confirm that they can be reasonably described using straight trendlines.
As another basic sniff test, you can get correlation coefficients using table calculations in your explore. Click the Calculations button in the DATA dropdown and use the correl function. The rule of thumb for correlations is that: