Resilience for NoSQL - Meet Datos IO

The growth of Cloud, social, and mobile technology is driving increased use of NoSQL databases like Apache Cassandra, mongoDB, Amazon DynamoDB and Google Bigtable. A recent study by ESG revealed that 56% of companies have NoSQL databases in production or deployed as part of a pilot/proof of concept. A similar study by Couchbase revealed that 90% of companies consider NoSQL important or critical to their business. Once the province of Internet innovators like Google, Amazon, Facebook, LinkedIn, and Twitter, the use of NoSQL databases is now widespread across every major industry. Modern applications create new demands that developers and DBAs must address such as:

  • Huge volumes of rapidly changing data including semi-structured, unstructured and polymorphic data.
  • Applications are now delivered as services that must be always on, accessible from mobile devices and scaled globally to millions of users.
  • Agile development methods where smaller teams work in sprints, iterate quickly and release code every few weeks, and in some cases daily.
  • Scale-out architectures using open source software, commodity servers and Cloud computing.

The choice for many of these applications is a distributed database (NoSQL) that stores portions of the data on multiple computers (nodes) within the network. These databases scale rapidly by adding nodes to a cluster making them effective for both Cloud and on-premise use. Their schema-less design allows data of different formats to be added accommodating the large amount of unstructured data (documents, audio, video, images) in use today. In-memory processing and the use of direct attached storage can provide fast processing of queries and support real time analytics.

The innovations in performance, cost, and agility found in NoSQL databases come at a cost. These open source databases lack the robust internal tools necessary for effective data protection and disaster recovery (DR). They are not as operationally mature as their relational database counterparts. It took many years for IBM, Oracle, and Microsoft to build robust data management and protection capabilities into their products. Low cost and high performance don’t matter very much if databases are offline or contain corrupted data.  Historical approaches and mainstream data protection solutions used widely today are not suitable for NoSQL data protection, disaster recovery and copy management. Given the need for always-on resilient systems, NoSQL applications will require better tools for data protection, compliance, and disaster recovery. The issues include:

  • NoSQL databases are eventually consistent and scale-out using locally attached storage (DAS). LUN-based snapshots do not produce application-consistent copies of the database suitable for restore or use in DR.
  • Stopping database writes to allow the system to catch up and produce an application-consistent snapshot is not practical.
  • Traditional backups of cluster nodes don’t describe the dependencies between VMs and their applications. A detailed understanding of application configurations is required to restore a NoSQL cluster from VM backups.
  • NoSQL nodes can be dynamically added (and removed) causing data to move between the nodes. Backups using traditional products don’t reflect the current state of NoSQL nodes making them unusable for DR.
  • Taking application-inconsistent snapshots and running a database repair (replay) process takes hours–days and is not practical for low RPO DR.

The main tools used today for NoSQL fall short of providing a robust, efficient solution for data protection and DR. Scripting is a common tool used in backup and recovery, however, it is labor intensive and prone to error as configurations constantly change. Replication is meant to address availability and is often not effective for data backup and DR. If the database becomes corrupted then bad data simply replicates itself across nodes. Keeping multiple replicas to support DR creates wasted storage space, management issues, and the need for some method to deduplicate the data.

Given the importance of NoSQL to mission critical applications it is not surprising new data protection and DR solutions have come to market. One such solution is from Datos IO (http://www.datos.io).  Datos IO is driven by a vision to make data recoverable at scale for next-generation databases. Their product solves the main issues described above. It creates application-consistent point in time backups (versioning) across all nodes of common NoSQL databases. These consistent copies remove manual effort during restores or the need for replaying activity logs. Datos IO allows point in time versions to be created as frequently as every 15 minutes. Backups can be stored on-premise (in NFS format) or in public cloud.

Backup is only a part of the Datos IO solution. Their solution also provides near instantaneous data restore. They do this by storing NoSQL backup data in native format. Datos IO provides backup of the database at granular levels providing the technology needed to meet low RTO-RPO scenarios. Backups are incremental forever limiting the bandwidth required to keep NoSQL backups current and lowering operational costs. The Datos IO backup becomes the single point of truth about the state of a NoSQL database.

Disaster recovery can be accomplished with a single click and is configurable to meet a variety of needs. The database can be restored to the same, or an entirely different cluster, an important feature for Cloud-based DR. This capability also allows Datos IO to support DevOps use cases by rapidly creating test/dev nodes or data migrations across Cloud environments. Datos IO is also space efficient and performs semantic deduplication of its backups saving customers up to 70% on the cost of recovery storage.

Data protection and disaster recovery technologies follow new innovations to market. The explosive growth of NoSQL databases requires the operational maturity of their relational database counterparts. Datos IO is bringing the robust quality of traditional data protection products to modern Cloud, big data, distributed applications.

To learn more about data protection, Cloud, and trends in resilience subscribe to my blog here

The Case for Resilience

The IT analyst firm Gartner predicts that by 2020 there will be over 26 billion devices connected to the Internet. When your alarm clock goes off in the morning it will notify your coffee maker to begin brewing. Five million new devices are attached to the Internet every day streaming digital information to be captured, analyzed, and turned into useful information. Technology innovations such as Cloud computing, smartphones and new distributed database structures (e.g. NoSQL) have replaced legacy IT systems to provide rapid, scalable IT services. The pace of business is accelerating and our reliance on technology has never been greater. Speaking at a recent conference of business leaders in Davos, Switzerland John Chambers, former CEO of Cisco told an audience that “Forty percent of the companies in this room won't exist, in my opinion, in a meaningful way in 10 years unless they change dramatically”.

Today’s economy is being increasingly defined by digital technology. Companies have designed IT systems that connect them to their customers, suppliers, and partners in real time. Data from transactions and interactions is captured and analyzed resulting in faster decisions which reflect current market conditions. The Internet of Things (IoT) is allowing any device with an on-off switch to be connected to the Internet or each other. This includes cars, fitness trackers, coffee makers, jet engines, traffic lights, water systems, etc.

As companies race to integrate digital technology their reliance on IT is increasing. The loss of IT systems or applications is felt immediately by customers, suppliers, and business partners. In many cases customers can fire you with two clicks of a mouse. The cost of downtime is increasing. A study by IDC revealed that for the Fortune 1000, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion. The average cost of a critical application failure is $500,000 - $1M per hour.

Since the 1980’s companies have relied on a centralized IT function to protect information and recover systems if they fail. During the past 35 years the disaster recovery industry grew in response to the need for information protection. That industry is now at an inflection point. The role of centralized IT is changing rapidly with rise of Cloud computing and the proliferation of mobile devices. The ease and speed with which computing power can be purchased and new applications can be composed has complicated IT’s ability to provide reliability and ensure availability of distributed systems and data. Traditional methods for backing up data and providing disaster recovery are often not effective for cloud-native applications.

Consider that several years ago companies reported a tolerance for downtime of critical systems measured from 24-48 hours. A recent study by a leading IT industry analyst showed that 83% of companies now report maximum acceptable downtime of 4 hours or less and an additional 7% of companies reported that they had 0-1 hour or less of tolerance for downtime!

Meeting this demand will require a new way of thinking; resilience must be engineered into systems as opposed to the traditional method of bolting disaster recovery onto their backend. To meet this demand companies must shift their focus from planning to recover from failures to ensuring that systems keep running in the event of failures. This (not subtle) change will require new methods and skills and broader executive support from the C-suite and line of business leaders. It will also require tremendous new innovation and rethinking industry regulations that deal with the protection and preservation of digital records.

Today, over 90% of all corporate applications are being designed for Cloud and mobile devices. Cisco predicts that from 2014-2019 Cloud traffic will quadruple. The IoT, connected devices, and advanced analytics may make us all feel smarter, however,  they are also creating massive amounts of data which must be protected and new types of systems which must not fail. 90% of the data in the world today was created in the last 2 years. In the last 30 days people watched 4 billion hours of YouTube videos, created 30 billion new pieces of content on Facebook, and sent 12 billion tweets.

I have been asked by many to share my thoughts and opinions on the state of disaster recovery. This blog is an attempt to do just that - to share, to hear ideas, to challenge you to think about these issues and for readers to challenge my thinking. I hope you will join me in this new venture, provide me with feedback, and share your thoughts. Together, we will have meaningful discussions about a topic near to our hearts. Welcome!

The Role of Analytics in Disaster Recovery

This is part 1 of a multi-part series on the evolution of analytics in disaster recovery

It may seem odd to discuss the role of analytics in the field of disaster recovery. These disciplines appear to have little in common. Wikipedia describes Disaster Recovery (DR) as a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Analytics is described as the discovery and communication of meaningful patterns in data.

In this series I'll discuss how analytics will improve resilience, lower risk and enhance business continuity. I'll explore how analytic DR services could come to market, which parties stand to benefit most, and some of the challenges that lie ahead. Part 1 will discuss how analytics will enhance disaster recovery (near term) and a vision in which analytics and automation are combined to improve risk management. 

The evolution of DR closely follows the development of IT, providing methods, products, and services to recover systems within required time frames and levels of data currency. From the early 1980’s until about 5 years ago disaster recovery mainly focused on the backup and recovery of physical computer systems. Given the need to recover physical systems to a like environment, vendors aggregated clients with like IT environments to provide shared DR services. These services made DR more affordable to many companies. This model of recovering physical systems worked well when acceptable downtime for most IT systems could be measured with a calendar.

Today, this is no longer true. Over 90% of all new applications are being developed for the Cloud. Cloud infrastructure, application characteristics and data structures are different. Cloud workloads are deployed in virtual environments, often spread across geographic boundaries. Many companies use combinations of private and public (hybrid) Clouds to run their applications. Cloud resources are dynamically added and removed based on capacity demand. And forget that calendar; downtime tolerance for most Cloud systems is minimal, measured with either a clock or stopwatch. 

By capturing and analyzing metadata stored in the Cloud stack companies will be able to gain deep insight into data protection and disaster recovery. Analytics can be applied across the IaaS/PaaS layer and across DR functions to help companies better understand data protection and DR functions such as backup, replication, DR testing, and system recovery. It should be noted that some tools used in physical DR setups capture data that can be analyzed to gain insight into discrete functions, e.g. the success rate of data backups. Cloud analytics will allow companies to gather information across the spectrum of data protection and DR functions to gain insight into how DR is working, and how Cloud resources can be optimized. Analytic data and algorithms will be used to make recommendations on how DR processes can be improved to produce better outcomes.

DR analytics will benefit companies and vendors alike. DRaaS vendors will use analytics to optimize DR capacity and costs across Cloud infrastructure. Metadata can be mined across customer segments to produce useful benchmark data helping customers improve DR and BC management.

The first wave of analytic implementations will be used to help companies improve data protection, monitor compliance, enhance DR testing, and design affordable resilience for critical IT systems. Analytics will also be used to help optimize DR Cloud capacity, costs, performance, and resource allocation.

But the use of analytics will not stop there. Cloud automation, inter-Cloud operability, IoT, and predictive analytics will be combined to usher in a new era that may change how DR is performed today. I define this new era as predictive risk management. Predictive analytics will examine a variety of threat and risk data (in real time) and determine if critical Cloud workloads are exposed to unacceptable levels of risk. These analytic models will be combined with Cloud automation to move workloads out of harms way. This model of resilience will change how companies manage risk and how DRaaS vendors provide service. In future blogs I will discuss how this model might evolve and some of the challenges involved in bringing predictive risk services to market.

Disaster recovery techniques and technologies have evolved greatly over the past 30 years. Analytics in DR and the rise of Cloud computing will bring significant benefits helping companies design truly resilient systems and optimize DR functions in ways never before possible.