Posted:
Wow, we had a great response to our Data Warehouses pricing post - thanks for your comments and feedback. One of the more consistent messages we heard was - “More please,” so we’re listening to what you’ve asked and are here again with comparisons to more data warehouses.

As a brief reminder, a data warehouse is a tool that allows you to store large amounts of data and let you run analytics and reports on it. We looked at two data warehouses previously - Google BigQuery and a generic commercial data warehouse. Both of these can run in a public cloud, store large amounts of data, and run reports and analytics. If you would like a refresher on how we defined a data warehouse, or what makes running a data warehouse in the cloud different, please see the first section of our previous post. Many of you asked if we could provide more concrete examples - so we’re going to do that here.

As with our previous pricing post, we hope to show you why Google BigQuery is still an excellent choice for a data warehouse in the cloud.

An Example Cluster

In our last example, we looked at data warehouses storing 1 petabyte of data. To be consistent with that comparison, we will use that same size today. Today we will compare Google BigQuery, Amazon Redshift, and an open source query engine on top of Apache Hadoop. Why include Hadoop? HDFS run by Hadoop is capable of storing huge amounts of data and there are several tools capable of analyzing data, which we will discuss below. Once again, there are more we could discuss, but this now includes products that are used by a wide variety of customers and also represent varied product types that are commonly labeled “Data Warehouse.” For all of these except Amazon Redshift, which is only available on Amazon Web Services, we will look at the prices of running these on Google Cloud Platform (again, check out our pricing post for why this can benefit you). These three systems also meet our basic requirements that we established in the previous post - store large amounts of data; run analytics and reports on them; and run full versions in a public cloud.

Once again, let’s look at how these data warehouses are priced. As a reminder, BigQuery has a pure consumption-based model - you are charged only for what you consume, both for storage and compute. The price for storage is $0.02/GB/month, and the price for data analyzed is $5/TB analyzed at the time of this post.

Amazon Redshift is priced for the amount of data you store, by number of nodes, with certain node types having a certain amount of storage. There are some more complexities in how you can price a Redshift cluster, but we'll go into more detail when we discuss pricing below.

Finally, we have Apache Hadoop. Hadoop is open source software, so there is no software license cost (unless you use a managed Hadoop provider such as Cloudera or Hortonworks which we will not examine here). However, you will need to pay for the infrastructure to run the Hadoop cluster, and we will look at what that cost would be as well. The query engine run on top of the Hadoop cluster will also be open source. We won’t go into detail about which one you could use, as it doesn’t affect the pricing, but there are a number of options, such as Apache Hive, Cloudera Impala, Apache Spark SQL (formerly Shark), or Presto.

Sizing Our Clusters

We will use the same parameters to price BigQuery as we did in our previous post. That is we will estimate 100 users, 40 queries each per day, with 100 GB average query size. That comes out to 4000 queries per day and 12,000 TB data analyzed per month (simplifying to 30 day month).

In order to correctly price Redshift, we need to take a few factors into consideration. First, there are 4 current types of Redshift nodes: ds2.xlarge with 2 TB and ds2.8xlarge with 16 TB of spinning disk storage; and the dc1.large with 0.16 TB and the dc1.8xlarge with 2.56 TB of SSD storage. You can only create clusters of 1 node type. Next, you can only create clusters of up to 32 nodes of each of the smaller node types (dc1.large and ds2.xlarge). The larger node types (ds2.8xlarge and dc1.8xlarge) allow you to create clusters of up to 128 nodes.

Additionally, each of these nodes can be run at “on demand pricing” or with “Reserved Instance pricing” (see our previous pricing post for a more detailed explanation on Reserved Instance pricing). Reserved Instances are an alternate payment model, where you can make a commitment to run infrastructure for a longer period of time (either 1 or 3 years), and opt to pay some portion or all of the costs up front. When you purchase a Reserved Instance, you pay a lower “hourly” cost even though you’re committed to paying for the full time duration. We will examine the pricing for the ds2.8xlarge (spinning disk), as well as comparing the price on demand, with 1 year Reserved Instances, and 3 year Reserved Instances. For reference, according to Redshift’s FAQs, the ds2.8xlarge has 36 virtual cores and 260 GB of RAM. In order to have 1 PB of storage, a ds2.8xlarge cluster will need 63 instances, which is just over 1 PB of storage.

We did look at the pricing for the SSD dc1.8xlarge nodes, but we won’t include those numbers for 3 reasons here: 1) the SSD clusters have a maximum size of 328 TB, well below our example 1 PB data warehouse; 2) we believe an SSD based data warehouse is generally used for more specialized use cases that involve smaller data sets; 3) finally, the purpose of this post is to clarify pricing in cloud data warehouses. The way Redshift prices its nodes is the same for all node types so we’ll only need to examine one node type.

In creating a Hadoop cluster, we have the ability to create clusters of any size, using any type of instance. However, to keep the comparison consistent with Redshift, we’ll use the n1-highmem-32 instance. Again, for the same reasons as with the generic data warehouse and Redshift, we will limit our comparison to just a spinning disk based cluster, rather than price an SSD cluster as well.

Pricing

Please note we completed these calculations on June 22, 2015, and have included the output prices in this post. Any discrepancies are likely due to pricing or calculator changes following the publishing of this post.

Additionally, Amazon recently changed the pricing structure of Redshift and as of writing, has not updated the calculator to reflect all of these changes. In any places where the calculator can’t be used, we have included the calculations we used to arrive at any numbers.

BigQuery estimate: Monthly: $79,999.50
This includes $20,000 for storage and $59,999.50 for query costs.


Redshift on demand estimate: Monthly: $315,588.80

Already you can see BigQuery on demand is much cheaper than Redshift running on demand. If you don’t want to make any long term commitments to AWS, you’ll pay nearly four times the cost!

Note, even though the calculator estimate is for the previous generation node type (dw1.8xlarge), the pricing for on demand is the same as the on demand rate for the ds2.8xlarge.

Pricing a Hadoop cluster, given our assumptions, is relatively straightforward, as it’s exactly the same as the generic data warehouse estimate, except without the license costs.

Hadoop estimate: Monthly: $143,017.60

This price makes running Hadoop cheaper than Redshift on demand. This is actually partially due to the Sustained Usage Discounts you get running on Google Compute Engine. This discount is something GCE will automatically do for you as you run your virtual machines. However, what you gain in pricing and lack of a license fee, you lose in operational cost. A Hadoop cluster will require a hands on operational team to maintain, unlike a managed solution.

We also examined how much it would cost if you were to purchase Reserved Instances for Redshift. Here are the costs for our Redshift data warehouse with Amazon’s Reserved Instance pricing:

Redshift 1 year, no upfront, Reserved Instance estimate:
Upfront: $0.00
Monthly: $248,346.00 (3,942 * 63)
Effective monthly: $248,346.00

Redshift 1 year, partial upfront, Reserved Instance estimate:
Upfront: $1,260,000.00 (20,000 * 63)
Monthly: $79,128.00 (1,256 * 63)
Effective monthly: $184,128.00 ((1,260,000 / 12) + 79,128)

Redshift 1 year, all upfront, Reserved Instance estimate:
Upfront: $2,164,680.00 (34,360 * 63)
Monthly: $0.00
Effective monthly: $180,390.00 (2,164,680 / 12)

All of these options drop the monthly cost. In the case of the latter two options, significantly over the on demand rate, with the full upfront payment option being slightly more than half the effective monthly cost. Since the AWS calculator doesn’t have all of these payment options yet, we’ve included the calculations. However, even with spending more than two million dollars up front, it is still more than double the cost of running BigQuery.

Redshift 3 year, partial upfront, Reserved Instance estimate:
Upfront: $1,512,000.00 (24000 * 63)
Monthly: $41,958.00 (666 * 63)
Effective monthly: $83,958.00 ((1,512,000 / 36) + 41,958)

Redshift 3 year, all upfront, Reserved Instance estimate:
Upfront: $2,824,920.00 (44,840 * 63)
Monthly: $0.00
Effective monthly: $78,470.00 (2,824,920 / 36)

The price drops quite a bit more with the 3 year Reserved Instances - less than half of what the 1 year all up front Reserved Instance costs. Again, the AWS calculator doesn’t have these payment options yet, so we’re including our calculations. And, as it turns out, the 3 year Reserved Instance, paying all upfront, is actually about $1500 cheaper than BigQuery per month.

Using Reserved Instances brings the prices down closer to BigQuery, but it still can be more expensive and brings a number of disadvantages to your business:
  1. Even for the cheapest option (3 years, all upfront), you would have to pay nearly three million dollars up front. As discussed in our previous pricing post, capital costs are expensive. Even if you can afford it, most businesses pay ~7% per year cost of capital. This would add an additional $197,744 to the 3 year upfront cost of the Reserved Instances. Factoring that in to the monthly cost of Redshift would bring the effective monthly cost to nearly $84,000 - more expensive than BigQuery ((2,824,920 * 1.07) / 36 = $83,962.90).
  2. Redshift is less than 3 years old and has already introduced two instance types (the dc1 and the ds2 instances). That’s 3 instance families in 3 years. Redshift Reserved Instances lock you into current pricing, technologies, and instance types. You are unable to take advantage of future price cuts, new instance types, or new technologies. You could mitigate this somewhat by locking in for 1 year Reserved Instance but pay more than double the price of the 3 year Reserved Instance and nearly four times the price of BigQuery.
  3. Purchasing a Reserved Instance is a manual process. In order to maintain your price advantages with Reserved Instances, a person must manually make sure to renew the Reserved Instance purchases at the right time, or risk paying the more expensive on demand rate - nearly double the 1 year Reserved Instance cost and nearly quadruple the 3 year Reserved Instance cost. So if you want to have a fully automated, fully managed product, you have a choice: you can either pay the expensive on demand rate, or break your automation by requiring a manual step to purchase and maintain Reserved Instances. This would mean Redshift couldn’t ever be a fully managed product for anyone who wanted to gain a price advantage. Increasing automation and taking advantage of fully managed services are among the most attractive benefits for businesses moving to the cloud.

Even with the 3 year Reserved Instance, including capital costs, Redshift is still 5% more expensive than BigQuery. And this is with paying more than $2,800,000 up front. Over the course of a year it will cost more than $47,000 over BigQuery and over 3 years it will be more than $142,000 more expensive than BigQuery. Additionally, as BigQuery increases performance you won’t be locked in to a lower performance tier.

Summary

Once again, we hope we took some of the mystery and difficulty out of pricing a data warehouse in the cloud. Even in comparison to two different data warehouses, you can see how much money BigQuery can save, while still scaling to whatever your needs are, as we showed in our first post.

Do you want more BigQuery? Try the demo, read our extensive documentation, or read an excellent whitepaper, written by Kazunori Sato.

At Google, we firmly believe that you can have the best products for our customers while continuing to aggressively drive down the costs of the cloud. We recently announced price drops for our virtual machines and announced a class of virtual machines - Preemptible Virtual Machines - that let you take advantage of 70% price savings for periodic workloads. This is in addition to the previous posts in this series on Virtual Compute, Local SSD, and our first Data Warehouses pricing post. Once again, if have any questions or comments on this topic or any other pricing topics in the cloud, let us know! Reach out!

- Posted by Peter-Mark Verwoerd, Solutions Architect

Posted:
Unit testing your code is a best practice in software development. By running small automated tests to individually verify each unit of code, such as a module, class or function, you can catch and debug errors early in your development process.

Google App Engine provides strong support for unit testing with Local Unit Testing tools, currently available for Python, Java, and Go.  With local unit testing, you run the unit tests within your development environment, without calling any remote components. The Local Unit Testing tools offer service stubs to simulate many App Engine services. You can use stubs as needed to  exercise your application code in local unit tests. You can also use open source packages like NoseGAE to further simplify the process of writing local App Engine unit tests.

Local unit tests using service stubs, however, handle routing and login constraints differently than code that calls the services directly. You have to keep these differences in mind when you design your tests.

When a customer had problems unit testing an App Engine cron handler, I was, of course, eager to help.  The cron handler was defined by the following  entry in app.yaml:


- url: /crontask.*
 script: thecron.app
 login: admin_only


The App Engine Cron Service recommends that you limit access to URLs used by scheduled tasks to administrator accounts. That’s what the login: admin_only setting does.  

The customer wanted to check that non-admin users would indeed be blocked from those URLs.

The customer started with a placeholder implementation for the cron handler, thecron.py:


import webapp2
import time


class TestCronHandler(webapp2.RequestHandler):
   def get(self):
       self.response.headers['Content-Type'] = 'text/plain'
       self.response.write('Cron: {}'.format(time.time()))


app = webapp2.WSGIApplication([
   ('/crontask/test', TestCronHandler),
], debug=True)


Then they wrote a unit test , ut.py.
Note: the following test code uses the mock module. Run pip install mock to make the mock module available to your local Python 2.7 installation.


import sys


# configure unit testing for the case
# where the App Engine SDK is installed in
# /usr/local/google_appengine
sdk_path = '/usr/local/google_appengine'
sys.path.insert(0, sdk_path)
import dev_appserver
dev_appserver.fix_sys_path()


import mock
import unittest


import webapp2
from google.appengine.ext import testbed


import thecron

class CronTestCase(unittest.TestCase):


 def setUp(self):
     self.testbed = testbed.Testbed()
     self.testbed.activate()
     self.testbed.init_user_stub()


 def _aux(self, is_admin):
     self.testbed.setup_env(
         USER_EMAIL = 'test@example.com',
         USER_ID = '123',
         USER_IS_ADMIN = str(int(bool(is_admin))),
         overwrite = True)
     request = webapp2.Request.blank('/crontask/test')
     with mock.patch.object(thecron.time,
                            'time', return_value=12345678):
         response = request.get_response(thecron.app)
     return response


 def testAdminWorks(self):
     response = self._aux(True)
     self.assertEqual(response.status_int, 200)
     self.assertEqual(response.body, 'Cron: 12345678')

if __name__ == '__main__':
   unittest.main()


This first test, testAdminWorks, passed with flying colors, so the user added a second one:


 def testNonAdminFails(self):
     response = self._aux(False)
     self.assertEqual(response.status_int, 401)


But the new test failed--the status was 200 (success), not 401 (forbidden) as expected.

The problem was that unit tests do not go all the way back to app.yaml to get the complete routing and login constraints as would an application calling the services, not the unit-testing stubs. This test reached into the secondary routing in thecron.py, which doesn’t impose login constraints. The app.yaml routing and everything else, such as mime-types, login constraints, etc., get tested only by tests calling the services instead of the unit-testing stubs.

So the customer added an admin-checking decorator, needs_admin, to thecron.py:


def needs_admin(func):
   def inner(self, *args, **kwargs):
       if users.get_current_user():
           if not users.is_current_user_admin():
               webapp2.abort(401)
           return func(self, *args, **kwargs)
       return self.redirect(users.create_login_url(request.url))
   return inner


Then they decorated CronHandler.get with it:


class CronHandler(webapp2.RequestHandler):
   @needs_admin
   def get(self):  # etc, as before


Now the local unit tests calling the stub version of the cron service work, but an end-to-end test using the actual App Engine Cron Service functionality fails with status 401. What’s going on?

Long story short -- the App Engine Cron Service doesn’t log-in any user as it visits the appointed URLs -- therefore, in the handler, users.get_current_user() returns None.  Instead, the Cron Service sets a special request header -- X-AppEngine-Cron: true. This is a header that application code can fully trust, since App Engine removes such headers if they’re set in an external request.

All that the customer needed, to get their unit tests, end-to-end tests, local development application, and deployed application, working, was a slight modification to their needs_admin decorator:


def needs_admin(func):
  def inner(self, *args, **kwargs):
       if self.request.headers.get('X-AppEngine-Cron') == 'true':
           return func(self, *args, **kwargs)
       if users.get_current_user():
           if not users.is_current_user_admin():
               webapp2.abort(401)
           return func(self, *args, **kwargs)
       return self.redirect(users.create_login_url(request.url))
   return inner


The new if statement handles the cron job case, mocked or not. The second if, as before, ensures that non-admin (or non-logged-in) users are blocked.

The moral is, do invest time and care in unit-testing.  It ensures the present and ongoing quality of your code. The App Engine Local Unit Testing tools and open-source add-ons like NoseGAE simplify the process of writing and running unit tests.

- Posted By Alex Martelli, Cloud Technical Support