I don’t care who doubts our ability to succeed, as long as it’s nobody on my team.
– Kobe Bryant, Los Angeles Lakers Guard
Prepare For Takeoff
Everyone, these days, is jettisoning on premise storage and sending their data to the cloud. The reasons are varied, but generally come down to two factors: time and cost. Cloud storage, from any of the major providers like Amazon or Microsoft, can cost less than $0.02 per GB per month. Compare that to Apple’s revolutionary magnetic hard drive that debuted in 1981. It had 5MB of storage and cost $3,500, which is over $700,000 per GB. Ok, there was no monthly fee, but I digress. 😉 Time is usually how long it takes to get a new server, file share, or document repository installed in your corporate headquarters vs simply storing new data in Amazon S3 or Microsoft Azure. Or perhaps it’s the amount of IT resources that are needed to keep depreciating and out classed data centers up and running.
There are many advantages to cloud storage, which won’t be rehashed here. If you need a refresher (or convincing), this site may come in handy: http://www.enterprisestorageforum.com/storage-services/cloud-storage-vs.-on-premise-11-reasons-to-choose-the-public-cloud-1.html
For the moment, let’s assume that you have decided to move your data to the cloud. This article will help you decide where to move it, how best to do so, and an ideal way to keep it updated and fresh.
Where To Go?
In today’s cloud landscape, there are two players: Amazon and Microsoft. There are others, such as Google, but Amazon Web Services (AWS) and Microsoft Azure hold the keys. In addition to storage, they both offer services such as Virtual Machines, Caching, Load Balancing, REST interfaces, Web hosting and more, which can handle your other applications, should you need to migrate them to the cloud in the future. There are pros and cons to each, but both will handle your data securely, provide timely and cost effective access, and transparently maintain ready to use backups in case of unforeseen events. Let’s break them both down:
AWS S3 (Simple Storage Service) is, as the name states, pretty simple. It has a free tier with 5GB of data and then breaks down into 3 categories – Standard, Infrequent Access (IA), and Glacier. If you just need to stash old data in the cloud and have no idea how it will be used in the future, use IA or Glacier for extremely cheap storage. Glacier is only $0.004 per GB vs Standard at $0.023 per GB per month (US West Region). The trade of with Glacier and IA is that it takes a little longer to get at the data you want to use, anywhere from a few minutes to several hours. Data can be moved up down from Standard, IA, and Glacier tiers so, for instance, those old application logs that no one was using can quickly be made available for reporting when needed.
Standard storage is what most people use for quick access to data. For the first 50TB per month, the price is $0.023 per month (US West Region). Anything can be stored here, such as images, binary files, text data, et. AWS S3 uses “buckets” to contain data, which can have an unlimited number of object. Each object within a bucket is limited to 5TB in size. For a breakdown on AWS S3 pricing, go here: https://aws.amazon.com/s3/pricing/.
We’ll discuss how to migrate data to S3 a bit later. For now, know that access to your S3 data is through the AWS web console and a variety of APIs, such as the AWS S3 API, the s3:// or s3n:// file protocols, et. AWS S3 is secure by default, with only the bucket / object creators having initial access to the data. Permission is granted via IAM roles, secret / access keys, and other methods that are out of scope for today. A good primer for S3, including security, can be found at the S3 FAQ: http://aws.amazon.com/s3/faqs/.
Azure Storage has a lot more options that AWS S3. This can be confusing at first but also offers the most in terms of flexibility and redundancy for your data. There is a small free tier, and as a counterpart to AWS Glacier, Azure Storage offers “Azure Cool Blob Storage” for your archival, compliance, or other “don’t use but can’t throw away” data. Prices are usually less than $0.01 per GB per month in some regions.
Unlike S3, Azure Storage comes in several flavors of redundancy, so one can choose how many backups of their data exist and how easily they are accessed. If you have easily replaceable data, say from a 3rd party API or data source, then choose the cheaper LRS (Locally Redundant Storage) option which will keep local copies of your data in a single Azure data center. Need a durable, always available, “a crater can hit my data center yet I’m still OK” option? Then RA-GRS (Read-Access Geographically Redundant Storage) is the preferred option. This will ensure that copies of your data are also maintained at a second data center hundreds of miles away, yet always available for easy access. Middle ground hot and cool options exist as well. For a breakdown of Azure Storage pricing, please visit here: http://azure.microsoft.com/en-us/pricing/details/storage/blobs/.
Note: AWS S3 is functionally equivalent to Azure Storage GRS (Geographically Redundant Storage), so use this option when comparing prices.
Azure Storage uses “containers”, instead of buckets like S3, and containers can contain blobs, tables, or queues (discussed below). There is no limit to object size, yet each container can “only” hold 500TB. However, there can be multiple containers per storage account. Access to data is through the Azure Portal, Azure Storage Explorer, PowerShell, the Azure CLI (command line interface), and other APIs and file protocols.
Aside from regular blobs, there are a couple different types of data that can be stored in containers: tables and queues. Think of both as convenient layers laid over the raw data, to ease read and write access for different applications and scenarios.
An Azure Storage (AST) is essentially a NoSQL key-value store. If you’re not sure what that is, then you likely don’t need it. 🙂 NoSQL data stores support massive scalability yet the dataset and server sharding that is normally necessary (and a headache) for this is handled for you. AST, like other NoSQL datasets, supports a flexible schema model which allows one to keep customer data, application logs, web logs, and more – all with different schemas – in the same table. Learn more here: http://azure.microsoft.com/en-us/services/storage/tables/.
Azure Storage Queue (ASQ) provides cloud-based messaging between application components. Having a central messaging queue is critical for different applications and parts of applications that are often decoupled and need to scale independently of one another. This would only likely be needed if you have applications which currently store message data on-premises, but needs to be migrated to the cloud. The size of each message is limited to 64K, but there can be an almost unlimited number of messages (up to the container limit). Learn more here: https://azure.microsoft.com/en-us/services/storage/queues/.
Another option unique to Azure in general, is the ability to link your corporate Windows Active Directory or Office 365 Active Directory with the Azure Active Directory (Azure AD). This feature, called Azure AD Connect, allows SSO (single sign on) between on-prem and cloud based applications and services. It is easy, for example, to quickly setup permissions and roles that manage access to essential services and storage across your organization.
For a rundown on security and encryption options, please visit: http://docs.microsoft.com/en-us/azure/storage/storage-security-guide.
This is great for raw storage, but now what about my DATABASE?
Almost every organization has relational data. While this can be extracted and placed into raw storage, it’s often easier to just lift it entirely into the cloud and go from there. Both Amazon and Azure platforms support numerous relational database hosting options, from Azure’s SQL Database to AWS’s Relational Database Service and many options in between. We’ll look at some of them here:
AWS Relational Database Service (RDS) allows easily deploying and scaling relational databases in the cloud. It frees one from the hassle of managing servers, patching, clustering and other IT heavy tasks. Also, it supports six different flavors of database: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, and Microsoft SQL Server. One unique option is the ability to offload read traffic to one or more “Read Replicas” and thus increase availability and performance of your primary database instance. Database security differs depending upon your database flavor (some support encryption at rest, et) but the AWS RDS itself can be secured by being deployed within an organization’s AWS VPC (Virtual Private Cloud). In my opinion, AWS as a service has a simpler approach to security than Azure, because more AWS services can be setup behind the VPC, which acts as a gateway to sensitive data and applications. Learn more about AWS RDS here: http://aws.amazon.com/rds/.
AWS Redshift is Amazon’s data warehouse in the sky. Essentially a supersized PostgreSQL, it provides scalable, cost-effective SQL based storage that includes queries that can run both on S3 (via Redshift Spectrum) and Redshift. It stores data in a columnar-based fashion, giving fast query times over massive amounts of data. It might be overkill if your dataset is small, but if you have petabytes (or exabytes) of structured data that need analyzing quickly, Redshift can likely handle it. Start with AWS Redshift’s home page here: http://aws.amazon.com/redshift.
AWS Athena is a new service which attempts to blend the raw and relational data worlds together. Simply point it at an S3 bucket, define a schema, write your SQL query, and go. You only pay for the queries run on the raw data and the schema definition can be reused with other queries, modified for another run, or simply tossed away when finished. Also, Athena can turn around and store the results back into AWS S3 or be used by another workflow to furnish data. By not having a permanent relational layer, data workflows and ETLs have less steps and less points of failure. Learn more about AWS Athena here: http://aws.amazon.com/athena/.
Azure SQL Database is as simple as it gets: An SQL Server database as a service. No Virtual Machines or licenses to manage, no patching, lots of redundancy, and fast performance. One can pay per database or in “elastic pool database units” (EPDUs) which can be spread resources across many databases. Azure SQL database is meant for small to medium size databases (up to 50GB) that can operate independently from one another, as in a multi-tenant application or reporting solution. If you have a lot of SQL data to migrate, it is a good idea to break it up and store it along date or business lines or make the jump to Azure SQL Data Warehouse, a larger service meant for enterprise intentions (see below). Connections are made through a standard ODBC / JDBC connection string, with the URI as the service endpoint for your database.
Keep in mind, that this is not the same as full SQL Server. Since there is no real “server” involved, most system stored procedures (DBCC, et) and distributed transactions won’t work and SQL Server peripheral services, such as SQL Server Reporting Services (SSRS) and SQL Server Integration Services (SSIS) are not included. These voids can be filled by other services in the Azure stack; however, or by running a full copy of SQL Server in an Azure Virtual Machine (see below). Although a look at Azure Analytics is out of scope here, you should know that Azure supports an entire range of analytical services which can consume data from Azure SQL databases. Learn more about Azure SQL Database here: http://azure.microsoft.com/en-us/services/sql-database.
Azure SQL Data Warehouse (ASDW) is for enterprise grade data warehouse deployments. Like AWS Redshift, one can scale compute and storage independently and prioritize analytical workloads or long term storage. It also lets you pause the compute side entirely, turning ASDW into an archival data store when analytics aren’t needed. It also leverages Microsoft’s PolyBase service, which allows queries to execute against other data sources, such as Azure Data Lake, Azure HDInsight, or even another on-prem SQL Server data warehouse. Unlike Azure SQL database, Azure SQL data warehouse stores data in a columnar-based format, for maximum performance and scalability. Learn more about the Azure SQL Data Warehouse here – http://azure.microsoft.com/en-us/services/sql-data-warehouse.
Rolling Your Own – AWS and Azure Virtual Machines
Of course, if you need advanced SQL Server features, such as SQL Server Analysis Services (SSAS), or want to run a completely different database type, you can always spin up a Virtual Machine (VM) and install the relational database software there. Many VM images from both AWS and Azure come with SQL Server, Oracle, or other software preinstalled, so all you need is your licensing information. Also, some images include the cost of the database software, effectively renting the license to users for a monthly fee. This can be useful, for example, if you would like to try out the features of SQL Server Enterprise before making a full purchase. Virtual Machines are also useful when ETLs and data workflows need to also be migrated to the cloud, as the VM can simply host the software required to run it.
NOTE: When you go the VM route, you are usually responsible for hardware provisioning/formatting, software patches, service upgrades, and maintaining secure access (through firewall rules, etc) to your system. A good pro/con for evaluating Azure SQL Database vs SQL Server on Azure VMs can be found here: http://docs.microsoft.com/en-us/azure/sql-database/sql-database-paas-vs-sql-server-iaas.
Now, how do I get it up there?
Alright, you’ve chosen your cloud platform, you know what data to move… or do you? How do you prioritize what goes and what stays? Stay tuned for The Great Migration – Part II, where I’ll cover next steps in how to lift your data into the cloud.
Hope this helps and happy migrating! Feel free to email me at jasonc@esagegroup.com with any additional questions!
Sincerely, J’son