Getting Started With Hybrid Cloud Adoption

Dataiku Product, Scaling AI Joel Belafa

A recent Gartner research survey on cloud adoption revealed that more than 80% of respondents using the public cloud were using more than one cloud service provider (CSP). Therefore, data, analytics, and IT leaders must prepare for a multi-cloud and hybrid cloud world, where data governance, security, compliance, and integration become more complex than ever before.

This blog post will go over what hybrid cloud is, as well as some main pitfalls and considerations to take into account when building a hybrid cloud strategy. 

What is Hybrid Cloud?

Hybrid cloud is a solution that combines on-premise data centers and private cloud with one or more public cloud services, with proprietary software enabling communication between each distinct service. A hybrid cloud strategy provides businesses with greater flexibility by moving workloads between cloud solutions as needs and costs fluctuate. 

It’s also a good solution if an organization needs to offer services both in private data centers and on-cloud subscription. They would then be able to build web applications and services or machine learning models and use them both on-premise and on-cloud, as well as take advantage of their hybrid architecture to maintain communication between applications or data flow between the cloud and the on-premise infrastructures. 

blue and white clouds seen from a bird eye's view

Getting Started 

The first important aspect to consider in a hybrid cloud approach is the hybrid and inter-subscription network. The strategy of annexing the network segments of the cloud must be clearly defined, not only between on-premise data centers or between different cloud providers, but also between several subscriptions / tenants / projects of the same cloud provider.

There are generally two schools of thought when it comes to subnetwork classification:

  • Architecture Spanning: network segmentation by technology or by application. In an architecture spanning hybrid cloud model, different components of an application architecture may reside on-premises and/or in the cloud. For instance, the DBMS might reside on-premises and the applications that connect to it may reside in the cloud. 
  • Use Case Specific: segmentation by business unit or by project. In a use-case-specific hybrid cloud model, components are segmented by their development lifecycle function. For example, any of the development, test, quality assurance (QA), disaster recovery (DR), or production instances of a DBMS may reside on-premises or in the cloud.

The first option is generally the most common in legacy data centers, but when you take advantage of the cloud and hosted services or when you follow the design rules for distributed computing, the second option is by far the more preferable one. 

This becomes even more apparent when we look into the current difficulties of hybrid cloud adoption from  a business perspective.

Challenges and Pitfalls of Hybrid Cloud Adoption

According to Gartner, the main challenges and pitfalls of hybrid cloud adoption are as follows:

  • Hybrid data and analytics architectures that span cloud and on-premises environments are prevalent. However, architectural considerations for dealing with a hybrid database management system (DBMS) cloud environment are neither inherently obvious nor consistent, which has financial and performance implications for data and analytics leaders.
  • Data and analytics leaders are challenged to understand how data flows, both in volume and direction, and how data location impacts performance, application latency, SLAs, high availability and disaster recovery (HA/DR) strategies, and financial models in hybrid DBMS cloud scenarios

After observing a couple of projects, one can confirm these two problems but also realize that they are in fact two sides of the same coin. The ability to create a sustainable hybrid cloud data platform lies in the location of the data but also the circuit through which the data circulates during processing.

If this is obvious on the macro scale (that is to say on the inter-application scale), this is not always obvious on the more reduced scale of data processing. It’s mainly due to the fact that in data science projects, you traditionally add more and more data sources over time. Thus for a given use case it’s rare to have a preconception of what the final data flow will be. 

It is therefore a question of choosing the right location but also of the technologies and storage formats compatible with the operation carried out in order to maximize productivity during the design period and optimize and control costs during the production phase.

 

With Dataiku DSS, a fairly simple matrix of rules can be described as follows:

  • For operations that can be transcribed into SQL or Spark, the priority should be given to the location and the source technology.
  • For other operations (model training, Python, R or Scala code-based data transformation), the data should be stored in files as much as possible and the long-term technology remains object storage.

The real experiences of customers who neglected this phase go through degraded performance when the physical resources are sufficient. The impossibility of parallelizing certain operations and the transit of data flows by too many intermediary processes are often the source of significant bottlenecks and frustrations.

The final item to watch out for is a potential stickiness to storage technology (which would only exist in the cloud). There are a few tips when you have this type of grip and the project is very advanced.

  • It is generally advised to split or clone parts of the workflow and migrate the data sources in a logical way (mainly for SQL databases)
  • Another facilitation would be to use abstract or polysemous interfaces such as Hadoop datasets which require different implementation depending on the environment (in HDFS on-premise and in Object storage in any cloud provider). With this, the overall project structure wouldn’t need to be modified. 

The last requirement would be suitable framework for infrastructure management as code, since the provisioning of your application layer must be as flexible as your cloud infrastructure one, as you don’t want any manual intervention in the process. On Dataiku DSS, we make the necessary framework accessible possible:

  • By client or server API
  • By Command Line Interface

This will facilitate integration with any open CI / CD technology or tool that already exists and gives a chance to improve or evolve according to your future needs.

Scalability and Offload to the Cloud

Finally, when you are in a native cloud environment you usually naturally rely on the possibility of offload to the cloud for on-premise applications. Offloading to the cloud is fairly easy for activities or projects that don’t require external metadata or complex transfers. For instance, a predictive model exposed as a webapp can virtually hit unlimited horizontal scalability. 

Nevertheless, putting an efficient and reliable offload mechanism for data ingestion or any ETL task is key, since there is a chance that the transformation layer is in the cloud while the extract or the load layer is on-premise. The network traffic and its cost would then lower significantly the benefits brought by additional resources from the cloud. 

Dataiku DSS is now considered a cloud-native application because we have integrated compute execution into native hosted services from major public cloud providers. The latter having established standards on many of these services, integration is also greatly facilitated in private cloud or for less widespread cloud providers.

You May Also Like

Alteryx to Dataiku: Working With Datasets

Read More

I Have AWS, Why Do I Need Dataiku?

Read More

Talking AI Democratization With Dr. Anastassia Lauterbach

Read More

Why Data Quality Matters in the Age of Generative AI

Read More