Snowflake Beginners Guide

0
41
snowflake beginners guide

Snowflake is the publisher of a cloud data infrastructure solution that brings together all the data produced by a company’s various information systems within a single source of truth in order to use it for analysis purposes or to train machine learning models.

Founded in 2012, Snowflake was born in San Mateo, California thanks to Benoît Dageville, Thierry Cruanes, and Marcin Zukowski, all three former employees of Oracle.

After two years of working in Thierry Cruanes’ apartment in San Mateo developing the product, Snowflake was born in October 2014 and became the first Data Warehouse-as-a-Service service, shaking up an aging market for on-demand solutions. -premise and taking AWS, GCP and Azure ahead of the game since at that time, none of the cloud giants invested in this segment of the cloud.

Regardless of their size, companies have understood that the data they generate could hide within it relevant information which could allow them to better understand their customers.

Big data is a vast ecosystem with numerous technologies making it possible to exploit several terabytes of data. This is obviously an opportunity that can be exploited by large companies or scale-ups generating a significant volume of data and having a substantial budget to exploit this opportunity.

When should a company be interested in BigData?

Any business will create data to run its business. It starts with listings of customers and prospects, then quotes and invoices, to which will be added transactions, email exchanges, interactions on the various social media.

As the company grows, the sources of data will multiply and the formats will be more and more varied.

We will talk about Big Data when the volume, variety and velocity of data become so large that traditional software will not be able to respond to this mass of data.

Below a terabyte of data, a typical PostgreSQL relational database or MongoDB database is sufficient to meet most business needs.

On the other hand, if the volume of data increases and the business needs for more analysis or data processing rises, it will be necessary to consider the separation of the databases necessary for the proper functioning of the company’s information system from the data needed for analysis.

In this case, it will be necessary to consider a Data Lake or a Data Warehouse.

Between business needs and analytical needs

Depending on the company’s activity, the execution needs of the business may impose technical constraints. The different database technologies are divided into two large families according to their main missions.

Databases intended to store data in real time. This is the case of MySQL or MongoDB databases. They are categorized as OLTP (Online transaction processing) databases.

Other technologies have data analytics at their core. These are OLAP (OnLine Analytical Processing) databases. They are able to process large volumes of data quickly in order to produce various reports. For example, imagine a database that will centralize all the sales made by each checkout in Amazon.com. If you want to sum up, per day, all the turnover made thanks to computers and accessories, OLAP technologies make it possible to do this processing 1000 times faster than an OLTP technology.

Snowflake, and Data Warehouses in general, are part of these OLAP technologies. They are intended to allow different applications to quickly access and manipulate certain data via SQL queries, see Python, R or other languages.

Data Lake or Data Warehouse

When an organization decides to invest in big data, one of the first steps is to define an infrastructure to collect the data. The two main forms of big data storage are Data Lakes and Data Warehouses.

To distinguish these two types of data repositories, it is necessary to understand the usefulness of each of them as well as their users.

The Data Lake is a place where the various applications of the company’s information system will dump the data that they can collect or produce with very little upstream processing.

The data present in a Data Lake is most often flat, that is to say without joins or references, coming from various sources, and can be structured, semi-structured or unstructured.

This infrastructure is often the easiest to set up but the most difficult to operate. Indeed, the very volatile data format will require processing before it can be exploited. Without this thinking and processing work, a Data Lake can turn into a Data Swamp, rendering the data it contains unusable.

The Data Warehouse is a more elaborate infrastructure model. Data entering the warehouse needs to be processed before being stored there. The Data Warehouse bases the exploitation of the data it contains on an ETL process (Extract, Tranform, Load) allowing the loading of data from the various applications.

To ensure the proper functioning of the Data Warehouse, the data collected will be:

  • Subject-oriented: data is organized by topics, allowing all of the company’s data to be categorized in a single warehouse
  • Integrated: the data must be processed and then integrated into the data warehouse, regardless of their origin. They can come from traditional OLTP databases, such as a relational database used in an IS application, or from an external source such as analytical data or from third-party sources such as Facebook or any other social network.
  • Have a temporal variant: The stored data is immutable. Unlike a traditional database which will modify a line if necessary, the objects stored in a data warehouse will remain immutable and each modification will add a new line. This allows to have a complete history for each object.
  • Non-volatile : Once stored, data cannot be erased.

Snowflake: What is it?

When launched in October 2014, Snowflake was the first data warehouse solution designed to be delivered on a cloud. The commercial name chosen for the solution was Snowflake Elastic Data Warehouse.

The idea for Snowflake Elastic Data Warehouse was to offer users a solution, in the cloud, which would bring together all the data and processing processes in a single data warehouse, while guaranteeing good performance in data processing, flexibility in their storage and ease of use for users .

Snowflake started with the following value propositions:

  • Data warehousing as a service. Thanks to the Cloud, Snowflake eliminates the problems of infrastructure administration and database management. Like DBaaS, users can focus on processing and storing data. By getting rid of a physical infrastructure, the cost of a data warehouse becomes variable and can be adapted to the size and power required by the customer.
  • Multidimensional elasticity. Unlike products on the market at the time, Snowflake had the ability to scale up in storage space and computing power independently for each user. Thus, it was possible to load data while running requests without having to sacrifice performance because the resources are dynamically allocated according to the needs at each moment.
  • Single storage destination for all data. Snowflake allows all of the company’s structured and semi-structured data to be stored centrally. Analysts wishing to manipulate this data will be able to access it in a single system without the need for processing before they can do their analytical work.

The Unique Architecture of Snowflake

A hybrid architecture between Shared Disk and Share Nothing

Snowflake makes large-scale data storage and analysis possible with its innovative architecture. Being exclusively a cloud product, Snowflake is based on virtual computing instances, such as Elastic Cloud Compute (EC2) at AWS, for calculation and analysis operations, in addition to a storage service, such as Simple Storage Service ( S3), to persist the data in the Data Warehouse.

As with any database, a Snowflake cluster has storage resources (or disk memory), RAM, and CPU computing power. Snowflake is based on a hybrid architecture, mixing shared disk ( Shared-Disk Architecture ) with an isolated architecture ( Share Nothing Architecture ).

All the data stored in the Snowflake Data Warehouse is gathered in a single directory, like shared disk architectures, and is accessible by all the computation nodes present in the cluster.

On the other hand, requests made on Snowflake using MPP (Massively Parallel Processing) calculation clusters are processed by nodes where each cluster contains only a portion of the data present in the Data Warehouse.

By mixing the two approaches, Snowflake offers the simplicity of data management thanks to its centralized data space, while combining the performance of a Share Nothing architecture for queries on the data that the warehouse may contain.

The Three Layers of Snowflake

The Snowflake Data Warehouse is based on 3 layers:

  • Data storage
  • Query processing
  • Cloud services

When data is inserted into your Snowflake Warehouse, it compresses it, reorganizes it in its column format and enriches it with metadata and statistics. Raw data will no longer be accessible directly but only via queries (SQL, R or Python) made through Snowflake.

Snowflake also has a processing layer to handle queries on the data. Data queries are executed on “virtual warehouses”. Each virtual warehouse is an MPP cluster based on a Share Nothing architecture, having several nodes, but storing only part of the entire data of the Data Warehouse.

Each Virtual Warehouse is capable of processing a multitude of simultaneous requests, and the computing cluster is capable of growing or shrinking depending on the workload at a time T. The different virtual warehouses do not share any resources, neither computing nor of memory, nor of storage, allowing each warehouse to have no resource conflicts or competing requests for the same data.

Finally, cloud services form the top layer of the Snowflake infrastructure, allowing different services to coordinate across the infrastructure of a Data Warehouse. These services allow users to authenticate, launch or optimize data queries, administer clusters and many other features.

Data Protection in Snowflake

Snowflake ensures the integrity of the data it hosts via two features, Time-Travel and Fail-Safe.

The Time Travel allows, when the data is modified, to keep its state for the entire configured duration. Limited to a single day of history in the standard version, Time Travel can be configured for up to 90 days in the Snowflake Enterprise license and allows reverting to a previous state of a table, schema or database complete.

The Fail Safe feature offers a 7-day backup after the end of the Time Travel period in order to recover data that would be corrupted by errors during operations.

Both of these features are themselves data creators and help fill the billed storage space of the Snowflake cluster.

Pricing: How much does a Snowflake cluster cost?

Snowflake being a product available exclusively on the cloud, its price varies according to your use and is impacted by three parameters, underlying the operation of the service. These are storage costs, compute costs, and the desired level of service.

Storage costs, very simple and explicit on Snowflake, are $23 per month, per one terabyte, if you choose to pay in advance for a year of use. If you want to pay monthly, you will have to inflate the monthly price to $40 per terabyte.

Calculation costs are more complex to understand since they can vary depending on the choice of cloud provider (Azure, AWS or GCP) as well as the geographical region where you want to host your cluster.

Snowflake will measure a unit of computing capacity into what they call a credit. This credit represents one full hour of an XS size compute node. Snowflake will charge you per second of use of your warehouse, while keeping a minimum billing of one minute.

Depending on the support criteria, features and standards you want for your Warehouse, you will need to choose a Snowflake license. For the same credit, the latter will have a different cost depending on the license for which you choose.

For an SME that wants to get into big data and is setting up its first Data Warehouse, the budget for a Snowflake cluster will be around €7,000 per year with prepayment.

For a larger company, ingesting a lot of data and requiring large processing and calculation capacities, the invoices can go beyond €500,000 annually.

The Snowflake pricing page does not provide a simulation of cluster costs, but there are unofficial calculators to help you estimate your annual costs for a Snowflake Data Warehouse.

Benefits of using Snowflake

Today, more than 6000 companies have chosen Snowflake, collectively paying more than a billion dollars to the American firm.

  • hybrid architecture is its real strength, giving users the ability to run complex queries on a large volume of data in the most efficient way possible and simultaneously.
  • Snowflake also offers multi-cloud and multi-region availability. Whether your IS is based on AWS, GCP or Azure, Snowflake can comply
  • The adaptability to structured data (ordered in columns like SQL) and semi-structured (XML, JSON, Avro, Parquet…) is another strong point of Snowflake. It is capable of ingesting this type of data and readapting it internally so as to respond to future requests involving this type of data.
  • These applications or external partners to the company will be able to rely on the volume of data and the processing performances with limited rights on the access to the data of the Warehouse.

Push data to Snowflake

The challenge of a Data Warehouse is to be able to push data into it, both produced internally by the various applications of the company’s information system, and external data that can be useful for analysis work.

To perform this loading of data to a Data Warehouse such as Snowflake, it is essential to pass through an ETL. This is a software allowing to Extract, to Transform then to load (Load), the data from its origin towards the warehouse by respecting the rules of formatting imposed by the diagrams of data.

Use connectors for market tools

For market applications, such as Salesforce, Shopify, SAP, Workday and many others, used by companies for their operations, there are a multitude of connectors to plug the application directly into Snowflake.

Stitch and Fortran are two examples of ETL (Extract, Transform, Load) services offering a catalog of connectors, ranging from business applications to cloud or on-premise databases.

Snowflake for Machine Learning

Business Intelligence projects are the primary consumers of a Data Warehouse like Snowflake.

But since the democratization of Machine Learning and the development of products such as TensorFlow, or ready-to-use ML APIs, Data Warehouse providers are seeing a new type of customers connecting to their data source.

Snowflake has provided Data Scientists with connectors to Machine Learning products such as AWS SageMaker or Dataiku.

What are the alternatives to Snowflake

Although Snowflake was the first player to define itself as a provider of a Data Warehouse on the cloud, several products from the Big Cloud providers themselves have emerged to compete with it.

Amazon Redshift

Redshift is the Data Warehouse product offered by Amazon Web Services. Like Snowflake, Redshift is designed to be fast and scalable.

However, it is based solely on a Share Nothing architecture, unlike Snowflake which is based on Shared Disk and Share Nothing

In terms of price, Redshift combines storage and computing power but also allows users to scale computing power only.

Customers opt for a Storage & Computing Power package that corresponds to their needs and can exceed this package by paying a supplement per use if a specific need arises.

Redshift is significantly less expensive than Snowflake and even offers significant discounts for a one-year or three-year commitment and prepayment.

Google Big Query

Big Query is the Data Warehouse offering offered by Google Cloud Platform.

Like Google Cloud Functions, Big Query is based on a Serverless architecture so that its users do not have to worry about the computing power or the storage space of their cluster. This will automatically adapt to the requests and data sent to it.

Oracle Autonomous Data Warehouse

The Oracle Autonomous Data Warehouse offer, like Google Big Query, is intended to be fully automated, from deployment to data security, including connecting to data creators and scaling up if necessary.

Unlike Snowflake and the competitors listed above, OADW is the only one available On-Premise for customers wishing to continue operating such infrastructure.

How is Snowflake positioned in the Data Warehouse market?

Although a precursor of Data Warehouse in the Cloud, Snowflake is not the leader either.

According to the study conducted by enlyft, SAP Business Warehouse, an On Premise solution is the leader with nearly 35% market share.

Then come the cloud players, led by Redshift. The solution offered by AWS is attractive due to its attractive displayed price and its easy integrations with the entire AWS ecosystem.

Snowflake comes after RedShift with 14% market share, ahead of Google Big Query which retains only 12%.

How companies use Snowflake?

The whole point of Snowflake, and of a Data Warehouse project in general, is to be able to use the data to produce conclusions about the business, formulate hypotheses and allow the company to act on them.

Snowflake helps Deliveroo offer more choice to its customers

If you don’t already know them, Deliveroo is a British startup allowing consumers to have meals delivered to their homes from their favorite restaurants. Subjected to extremely tough competition with Uber Eats, Deliveroo have chosen to use Snowflake to make strategic decisions to face this competition.

In 2017, Deliveroo wanted to increase the choice available to its customers by providing restaurateurs with kitchens to prepare meals closer to customers. For the investment to be profitable, both for the restaurateur and for Deliveroo, it was necessary to be able to identify the areas where the demand was strong for a certain type of dish in delivery and where the current supply of restaurants was insufficient.

Thanks to Snowflake, Deliveroo was able to use internal data to find out which dishes were already available in a given area. They were also able to capitalize on external data that they dumped into their Data Warehouse, to get more information on the demographics and road traffic in the same area.

Deliveroo, which previously used an AWS Redshift Data Warehouse, loaded its Snowflake data warehouse with over 35GB of data in less than 3 months. This migration to Snowflake was carried out to meet a growing need for simultaneous queries and peak loads from analysts and various Business Intelligence applications.

Hubspot

Hubspot is a SaaS publisher that provides its customers with a CRM to help them grow their business by allowing them to attract prospects via promotional content.

Before using Snowflake, analysis requests were made on the same machines as data dumping, which caused resource access conflicts.

Their analysis needs being very variable, the necessary computing power was not necessarily used linearly. With their traditional database systems, Hubspot couldn’t reduce the size of their cluster without having to do several manual actions and ended up paying for a much larger cluster than they needed just in anticipation of the occasional case where they would have actually need all the computing power available.

Hubspot was seduced by Snowflake because they can now have adaptive computing power, which will be able to scale up if the need arises, and which will not have to be paid for if it is not used.