Posted:
Many applications need to talk to other varied, and often, distributed systems reliably and in real-time. To make sure things aren't lost in translation, you need a flexible communication model to get messages between multiple systems simultaneously.

That’s why we’re making the beta release of Google Cloud Pub/Sub available today, as a way to connect applications and services, whether they're hosted on Google Cloud Platform or on-premises. The Google Cloud Pub/Sub API provides:

  • Scale: offering all customers, by default, up to 10,000 topics and 10,000 messages per second
  • Global deployment: dedicated resources in every Google Cloud Platform region enhance availability without increasing latency
  • Performance: sub-second notification even when tested at over 1 million messages per second

We designed Google Cloud Pub/Sub to deliver real-time and reliable messaging, in one global, managed service that helps developers create simpler, more reliable, and more flexible applications. It's been tested extensively, supporting critical applications like Google Cloud Monitoring and Snapchat's new Discover feature. Some common use cases include:

  • Integrated messaging between components of an application  for example, processing an office transfer in an HR system: developers need to control the distribution of updates to the company directory, to security badging, to the moving company, to payroll, and many other services.
  • Robust data collection from smart devices  for example, mobile device endpoints: providing developers with the ability to integrate sensor data from the endpoints with real-time data analysis pipelines, automatically routing the data streams to an application.

You can activate Google Cloud Pub/Sub today from the APIs & auth section of the Google Developers Console and monitor key metrics with Google Cloud Monitoring dashboards. Please share your feedback directly or join our mailing list for updates and discussions.

-Posted by Rohit Khare, Product Manager

Posted:

Introduction

What’s remarkable about April 7th, 2014 isn’t what happened that day. It’s what didn’t.

That was the day the Heartbleed bug was revealed, and people around the globe scrambled to patch their systems against this zero-day issue, which came with already-proven exploits. In other public cloud platforms, customers were impacted by rolling restarts due to a requirement to reboot VMs. At Google, we quickly rolled out the fix to all our servers, including those that host Google Compute Engine. And none of you, our customers, noticed. Here’s why.

We introduced transparent maintenance for Google Compute Engine in December 2013, and since then we’ve kept customer VMs up and running as we rolled out software updates, fixed hardware problems, and recovered from some unexpected issues that have arisen. Through a combination of datacenter topology innovations and live migration technology, we now move our customers running VMs out of the way of planned hardware and software maintenance events, so we can keep the infrastructure protected and reliable—without your VMs, applications or workloads noticing that anything happened.

The Benefits of Transparent Maintenance

Our goal for live migration is to keep hardware and software updated across all our datacenters without restarting customers' VMs. Many of these maintenance events are disruptive. They require us to reboot the host machine, which, in the absence of transparent maintenance, would mean impacting customers’ VMs.

Here are a few of the issues we expected to address with live migration, and we have encountered all of these:

  • Regular infrastructure maintenance and upgrades
  • Network and power grid maintenance in the data centers
  • Bricked memory, disk drives, and machines
  • Host OS and BIOS upgrades
  • Security-related updates, with the need to respond quickly
  • System configuration changes, including changing the size of the host root partition, for storage of the host image and packages

We were pleasantly surprised to discover that live migration helped us deliver a better customer experience in the face of a much broader array of issues. In fact, our Site Reliability Engineers started using migration as a tool even before it was generally enabled; they found they could easily work around or mitigate potential breakages occurring in production.

Here are some of the unexpected issues that we encountered and worked around with live migration without impacting the running guests:

  • Flapping network cards — Network cards were intermittently failing. We were able to repeatedly try the VM migrations and successfully migrate them. This even worked with partially-failing NICs.
  • Cascading battery/power supply issues — Some overheating batteries were overheating the neighboring machines. We were able to migrate the VMs away before bringing down the machines to swap out their batteries.
  • A buggy update was pushed to production — We halted the rollout, but not before it reached some of our production machines (it didn't manifest in our canary environment). The buggy software would’ve crashed VMs within a week. Instead, we migrated the VMs on the affected machines to other hosts that didn’t have the buggy software.
  • Unexpected host memory consumption — One of our backend components consumed more memory than we had allocated and threatened to OOM (out of memory) the VMs. We migrated some VMs away from the over-loaded machines and avoided the OOM failures while patching the backend system to ensure it could not overrun its allocation.

Transparent Maintenance in Action

We’ve done hundreds of thousands of migrations since introducing this functionality. Many VMs have been up since migration was introduced and all of them have been migrated multiple times.

The response from our customers has been very positive. During the early testing for migration, we engaged with Rightscale to see the impact of migrations. After we migrated all their VMs twice, they reported:

“We took a look at our log files and all the data in the database and we saw…nothing unusual. In other words, if Google hadn’t told us that our instances had been migrated, we would have never known. All our logs and data looked normal, and we saw no changes in the RightScale Cloud Management dashboard to any of our resources, including the zone, instance sizes, and IP addresses.”

We worked with David Mytton at ServerDensity to live migrate a replicated MongoDB deployment. When the migration was done, David tweeted:

“Just tested @googlecloud live migration of a @MongoDB replica set - no impact. None of the nodes noticed the primary was moved!”

In fact, Google has performed host kernel upgrades and security patches across its entire fleet without losing a single VM. This is quite a feat, given the number of components involved and factoring in that any one of them or their dependencies can fail or disappear at any point. During the migration, many of the components that comprise the VM (the disks, network, management software and so on) are duplicated on the source and target host machines. If any one of them fail at any point in the migration, either actively (e.g. by crashing) or passively (e.g. by disappearing), we back out of the migration cleanly without affecting the running VM.

How it works

When migrating a running VM from one host to another, you need to move all the state from the source to the destination in a way that is transparent to the guest VM and anyone communicating with it. There are many components involved in making this work seamlessly, but the high-level steps are illustrated here:

The process begins with a notification that VMs need to be evicted from their current host machine. The notification might start with a file change (e.g. a release engineer indicating that a new BIOS is available), Hardware Operations scheduling maintenance, an automatic signal from an impending hardware failure etc.

Our cluster management software constantly watches for such events and schedules them based on policies controlling the data centers (e.g. capacity utilization rates) and jobs (e.g. number of VMs for a single customer that could be migrated at once).

Once a VM is selected for migration, we provide a notification to the guest that a migration is imminent. After a waiting period, a target host is selected and the host is asked to set up a new, empty “target” VM to receive the migrating “source” VM. Authentication is used to establish a connection between the source and target.

There are three stages involved in the VM’s migration:

  1. During pre-migration brownout, the VM is still executing on the source, while most state is sent from the source to the target. For instance, we copy all the guest memory to the target, while tracking the pages that have been re-dirtied on the source. The time spent in pre-migration brownout is a function of the size of the guest memory and the rate at which pages are being dirtied.
  2. During blackout, which is a very brief moment when the VM is not running anywhere, it is paused, and all the remaining state required to begin running the VM on the target is sent. We go into blackout when sending state during pre-migration brownout reaches a point of diminishing returns. We use an algorithm that balances numbers of bytes of memory being sent against the rate at which the guest VM is dirtying pages, amongst other things.
  3. During post-migration brownout, the VM executes on the target. The source VM is present, and may be providing supporting functionality for the target. For instance, until the network fabric has caught up with the new location of the VM, the source VM provides forwarding services for packets to and from the target VM.

Finally, the migration is complete, and the system deletes the source VM. Customers can see that the migration took place in their logs.

Our goal for every transparent maintenance event is that not a single VM is killed. In order to meet that bar, we test live migration with a very high level of rigor. We’re using fault-injection to trigger failures at all the interesting points in the migration algorithm. We generate both active and passive failures for each component. At the peak of development testing (for months) we were doing tens of thousands of migrations every day.

Achieving this complex, multi-faceted process requires deep integration throughout the infrastructure and a powerful set of scheduling, orchestration and automation processes.

Conclusion

Live migration technology lets us maintain our infrastructure in top shape without impacting our guest VMs. One of our reviewers even claimed we’ve granted VMs immortality. We’re able to keep our VMs running for long periods of time in the face of regular and unplanned maintenance requirements and in spite of the many issues that arise requiring the reboot of physical machines.

We’re fortunate that some of the recent security issues that have affected other cloud providers haven’t affected us, but if and when a new vulnerability affects our stack, we’ll be able to help keep Compute Engine protected without affecting our customers’ VMs.

-Posted by Miche Baker-Harvey, Tech Lead/Manager, VM Migration

Posted:
Back in November, at Google Cloud Platform Live, we released the beta of Google Cloud Debugger with support for Managed VM based projects. Today, we’re expanding support for Google Compute Engine based projects. Now you can simply set a snapshot on a line of code and Cloud Debugger will return local variables and a full stack trace from the next request that executes that line. Say goodbye to littering your code with logging statements.

Setting up Cloud Debugger on Compute Engine is easy using the Cloud Debugger agent and bootstrap script – try it for yourself. You’ll need the following:


Cloud Debugger is available on both production and staging instances of your application and adds zero overhead on services that aren’t being actively debugged. The debugger adds less than 10ms to request latency when capturing application state and doesn’t block or halt execution of your application.

Stay tuned for support for other programming languages and environments. As always, we’d love direct feedback and will be monitoring Stack Overflow for issues and suggestions.

-Posted by Keith Smith, Product Manager

Posted:
We know that even the smallest service disruptions can cause inconveniences on your end  and not being able to find information about what's happening can be even more frustrating. We do our best to make sure outages don't happen, but when they do, we think it's important to be transparent and communicate those issues.

Starting today, you'll be able to receive status updates for Google Cloud Platform services on the Google Cloud Platform Status Dashboard. This augments the additional tools and services  including Google Cloud Monitoring  that monitor your service's health. We hope that these services will make disruptions a bit more bearable by surfacing the latest information and helping you quickly find workarounds.

As disruptions in service occur (and we’re working very hard to ensure they don’t!), they’re reported on the dashboard with a red bar, which persists until the disruption is resolved. The example below shows an example of a Google App Engine service disruption that occurred last month on Jan 28th.
The current status of Cloud Platform services is shown by the column of indicators on the right side of the graph. In the example above, the indicators are all green which means that all services were operating at normal levels at the time the screenshot was taken. You can also check out the status history of each service  the dashboard will always show the last seven days, but just click on “View Summary and History” to see any of the incidents reported over the past 90 days.

Click on any incident to get a more detailed explanation of what happened. The screenshot below details the above incident that began at 2015-01-28 17:01 and lasted 26 minutes. As you can see, the green indicators next to the dates and times mean the incident has since been resolved.
Stay up to date by subscribing to the Status Dashboard RSS feed, available from the link at the bottom of the Status Dashboard page. We have also integrated reporting of Cloud Platform service incidents in your Cloud Monitoring events log, enabling you to view incidents alongside your other dashboards and monitoring data.
While in beta, we’re reporting status for seven services, aggregated across all regions. The Status Dashboard does not deprecate any other means of communicating outages.

We’d love to hear your feedback and suggestions. Please submit your feedback by clicking the link at the bottom right of the dashboard page.

- Posted by Amir Hermelin, Product Manager

Posted:
In recent weeks we’ve published a number of posts as part of our series on Kubernetes and Google Container Engine. If this is your first foray into these blogs, we suggest you check out past Kubernetes blog posts.

Containers are emerging as a new layer of abstraction that make it easier to get the most out of your VM infrastructure. In this post, we'll take a look at the implications of running container-based applications on fleets of VMs and we’ll talk about why container clusters reduce deployment risk, foster more modular apps, and encourage sharing of resources.

Container Images
The first building block of a containerized application is the container image. This is a self-contained, runnable artifact, which brings with it all of the dependencies necessary to run a particular application component. The VM analogy is an ISO image, which usually contains an entire operating system and everything else installed on the machine. Unlike an ISO, a container image holds only a single application component and can be booted as a running container that shares an OS and host machine with other containers. The same app running on containers could be several GBs smaller depending on your Linux distro and the number of VMs. That will mean faster deployments and easier management.


Reducing Deployment Risk
You may have experienced deploying your application onto VMs in production  only to find that something has gone horribly wrong and you need to rollback quickly. The code may have worked on the developer’s machine, but once you run your deployment process you discover that an installation is failing for some unknown reason.

With container images, you can run an offline process (meaning not during deployment) that produces a reusable artifact that can be deployed to a container cluster. In this model, issues that would affect your deployment (like installation failures) are caught earlier and out of the critical path to production. This means you have more time to react and correct any issues, and rolling back is easier and less risky  just replace the container image with the previous version.

Modular App Components
As you're designing and building your application, it’s tempting to just add more pieces onto your existing VMs. The hard part is unwinding these pieces into modular chunks that can be scaled independently. When you suddenly run out of VM capacity, you can’t deliver a reliable service to your users. So it’s important to quickly add resources without re-architecting.

When you create a Kubernetes container cluster (for example, via Google Container Engine) you’re giving your app logical compute, memory, and storage resources. And it’s really easy to add more. Since your application components don’t care where they run, you have two independent tasks to complete:
  1. Create a fleet of VMs to host your containers
  2. Create and run containers on your fleet of virtual machines
Using containers for your application components and using Kubernetes as an abstraction layer makes your app naturally more modular. Of course, it’s possible to have modularity on VMs with well designed scripts, but with containers it’s hard not to design modular applications!

Shared Resources and Forecasting
For your application containers to run together on arbitrary computers, they need an agreement about what they're allowed to do. Container clusters establish a declarative contract between resource needs and resource availability. We don’t recommend that you use containers as secure trust boundaries, but running trusted containers together and relying on VMs let’s you get the most utilization within your existing VM boundaries.

Another problem you may face is how to forecast capacity across multiple people and applications. Your team can use Kubernetes to share machines while being protected from noisy neighbors via resource isolation in the kernel. Now you can see resources across your teams and apps, aggregating numerous usage signals that might be misleading on their own. You can forecast this aggregate trend into the future for more cost effective use of hardware resources.

Conclusion
The decoupling of applications from the container cluster separates the operational tasks of managing applications and the underlying machines. Modular applications are easier to scale and maintain. Building container images before deployment reduces the risk that you’ll discover installation problems when it’s too late. And sharing resources leads to better utilization and forecasting, making your cloud smarter.

If you’d like to take container clusters for a spin yourself, sign up for a free trial and head on over to Google Container Engine.

-Posted by Brendan Burns, Software Engineer

Posted:
Many developers containerize their application so that it can run on any infrastructure; however, it’s still too hard to run containers on a private cloud. Together with Mirantis, we’ve integrated Kubernetes, our open source container manager, in OpenStack. This integration will make it easier to run your apps on a private cloud, while enabling new “hybrid” cloud possibilities. To learn more, sign up for the waiting list.

New “Hybrid” Possibilities
If your company is bigger than a startup, you probably have both on-premises and public cloud infrastructure to host your portfolio of apps. This “hybrid” approach is great in theory: your on-premises infrastructure offers control and you can scale to the public cloud when necessary. Unfortunately, it’s not always easy to take advantage of this flexibility—it’s still too hard to move workloads between infrastructures.

With Kubernetes powering both your private and public clouds, you’ll be able to unlock the power of a hybrid infrastructure. For example, you might run a primary instance of your application in a private cloud, and then replicate other instances to Google Container Engine in geographies where you don’t have on-premises infrastructure.

Learn More
To learn more about how we’re working together with Mirantis, read their blog post. And feel free to stop by the Kubernetes Gathering on February 25th in San Francisco to see Mirantis give a full demo.

-Posted by Kit Merker, Product Manager, Google Cloud Platform

Posted:
Deploying a new build is a thrill, but every release should be scanned for security vulnerabilities. And while web application security scanners have existed for years, they’re not always well-suited for Google App Engine developers. They’re often difficult to set up, prone to over-reporting issues (false positives)—which can be time-consuming to filter and triage—and built for security professionals, not developers.

Today, we’re releasing Google Cloud Security Scanner in beta. If you’re using App Engine, you can easily scan your application for two very common vulnerabilities: cross-site scripting (XSS) and mixed content.

While designing Cloud Security Scanner we had three goals:
  1. Make the tool easy to set up and use
  2. Detect the most common issues App Engine developers face with minimal false positives
  3. Support scanning rich, JavaScript-heavy web applications
To try it for yourself, select Compute > App Engine > Security scans in the Google Developers Console to run your first scan, or learn more here.



So How Does It Work?
Crawling and testing modern HTML5, JavaScript-heavy applications with rich multi-step user interfaces is considerably more challenging than scanning a basic HTML page. There are two general approaches to this problem:

  1. Parse the HTML and emulate a browser. This is fast, however, it comes at the cost of missing site actions that require a full DOM or complex JavaScript operations.
  2. Use a real browser. This approach avoids the parser coverage gap and most closely simulates the site experience. However, it can be slow due to event firing, dynamic execution, and time needed for the DOM to settle.
Cloud Security Scanner addresses the weaknesses of both approaches by using a multi-stage pipeline. First, the scanner makes a high speed pass, crawling, and parsing the HTML. It then executes a slow and thorough full-page render to find the more complex sections of your site.

While faster than a real browser crawl, this process is still too slow. So we scale horizontally. Using Google Compute Engine, we dynamically create a botnet of hundreds of virtual Chrome workers to scan your site. Don’t worry, each scan is limited to 20 requests per second or lower.

Then we attack your site (again, don’t worry)! When testing for XSS, we use a completely benign payload that relies on Chrome DevTools to execute the debugger. Once the debugger fires, we know we have JavaScript code execution, so false positives are (almost) non-existent. While this approach comes at the cost of missing some bugs due to application specifics, we think that most developers will appreciate a low effort, low noise experience when checking for security issues—we know Google developers do!

As with all dynamic vulnerability scanners, a clean scan does not necessarily mean you’re security bug free. We still recommend a manual security review by your friendly web app security professional.

Ready to get started? Learn more here. Cloud Security Scanner is currently in beta with many more features to come, and we’d love to hear your feedback. Simply click the “Feedback” button directly within the tool.

-Posted by Rob Mann, Security Engineering Manager