Let’s face it: Architecture frameworks start to decay as soon as someone puts them on a PowerPoint slide. If there’s one thing we’ve learned at Dataiku after talking to thousands of prospects and customers about their data architecture, it’s that they also tend to be more aspirational than realistic because, at the enterprise level, data architecture is both complex and constantly changing.
So when it comes to a modern data architecture strategy, the most important factor is not actually the what but the how:
- Is your architecture agile enough to be able to easily adapt to changing needs and technology requirements?
- As data ambitions across the organization evolve in the next year, five years, 10 years, will the data architecture be able to support it?
- Is the team thinking about the business strategy around data and architecture, not just for downstream consumption, but also upstream (i.e., how data will be made available to other services like applications, APIs, websites, etc.)?
Aside from a small detour into what’s referred to today as the modern data stack, this article won’t detail exactly what the ideal architecture is for scaling AI, because the bottom line is … there isn’t one (at least, there isn’t one that works for every enterprise). Instead, it will focus on three key recommendations that will help teams determine and build the data architecture that’s right for them — and, more importantly, for the organization.
The Modern Data Stack
With the rise of GenAI, the maturity of cloud data, and the constantly-increasing scale of data availability, the pace of change is greater than ever before. Organizations can achieve tremendous value by using a modern data stack architecture that meets their needs today and into the future.
Any small or midsize business that’s serious about making the use of data, analytics, and AI (including GenAI) everyday behavior is using a version of the modern data stack architecture. It can even make sense in the enterprise context for teams just getting started on their AI journey.
Some of the key buzzwords associated with the modern data stack are managed, serverless, and low-technical expertise required. Because storage and compute are independent in the modern data stack (and because cloud data warehouses can store massive amounts of data for cheap), data transformation can be done more on-demand, which places less of a burden on IT.
Ultimately, the modern data stack is about providing a seamless experience for all users, no matter what their data needs are. It:
- Allows coders to do advanced data science on top of cloud data warehouses (including pushing down data processing tasks but also having the ability to operationalize data projects quickly, to be leveraged by consumers on the business side) and
- Allows non-coders (like analysts) to do their own data transformation plus advanced data work (e.g., predictive use cases) and
- Automates and orchestrates the operationalization piece, including pushing the results of multi-tool analysis back to the SaaS tools business users are leveraging.
Even for organizations that have a much more complex existing, legacy setup and therefore can’t fully leverage the simplicity of the modern data stack, the goal of providing a seamless experience for all users to work with data is a valuable takeaway.
3 Keys to a Modern Data Architecture Strategy
1. Don’t Over-Centralize
When it comes to data architecture, for the past five to 10 years, centralization has been the name of the game — potentially to a fault. In fact, the overwhelming majority of IT teams and leaders today have probably tried to centralize too much and too many times.
Here at Dataiku, we talk to a lot of teams (generally within large multinational enterprises across a range of industries) about their data architecture. Usually, when the question “how many single source-of-truth data warehouses do you have?” comes up, the answer is not one. It’s two, three, four, and sometimes even more. Sound familiar?
When efforts to centralize have failed over and over (and over and over) again, the answer isn’t to double down on centralization, but that’s often the reality. Today’s efforts to centralize may sound something like:
- “We need to get governance sorted out before we can start getting value from data and AI initiatives,” or
- “We need to solve data quality before we can start investing in tools to start doing any serious data science, machine learning (ML), or AI projects,” or
- “We just need to finish our cloud migration, then we can start getting return on investment from data.”
Governance, data quality, and the move to the cloud are undoubtedly critical topics. But the point is that while irresistibly tempting to undertake such projects from an IT perspective, more centralization without a larger, use case-based goal or purpose doesn’t actually generate any business value and can often mean that efforts in these areas ultimately fall flat.
It’s nearly impossible to have a conversation about centralized vs. decentralized data architecture without mentioning today’s trendiest term: the data mesh. The concept of the data mesh is less about architecture in the technical sense (while certainly there is something to be said about the tooling possibilities for data mesh architecture, it’s outside the scope of this ebook) and more about data architecture from an organizational point of view.
In a nutshell, the data mesh is about decentralization and business ownership of business data assets. That means instead of central teams like IT controlling the source of truth for data across business lines, that responsibility falls on the business itself.
The advantage of this approach is that it puts the onus on the business to maintain, use, and create value from their data. After all, if IT owns the source of truth, but no one agrees with it, what use is that centralization? Having every department, team, or even individual employees creating (and re-creating) their own “single view” of the customer is inefficient, on top of undermining the work IT puts into centralizing in the first place.
The disadvantage, of course, is that the data mesh approach is extraordinarily challenging to achieve in practice. AI platforms (like Dataiku) can lower the barrier to making this switch and put more ownership in the hands of lines of business. But as always, technology isn’t a magic bullet — the shift is also largely cultural and will take some serious change management.
There is a natural, underlying tension between IT and the business, the desire to centralize and to decentralize efforts. However, technology alone can’t (and doesn’t) resolve this tension — what does resolve it is aligning business needs as closely as possible with owners of the data. That means getting domain experts to decide what the data means, who should use it, how it should be used, and more.
2. Rethink the Role of IT as One of Creating & Delivering Value
For data practitioners, the best-case scenario is when all data is available and accessible on the same technology and it’s all joined together. Unfortunately, there is no product from any of the major cloud vendors that you can turn on and that joins all the data you have to multiply its business value.
What comes out of your operational or business systems is exhaust wherever you put it (cloud/on-prem, data lake/warehouse), and making that exhaust into a solid (i.e., valuable insights) requires human intelligence plus technology to gather all the exhaust from different systems, crystallize it, condense it into a liquid so that it can flow around more easily, then take the relevant liquids and freeze them into a solid — or a square table of numbers, usable by business teams.
Providing different teams — whether technical or not — with the skills to do this is the root of IT’s role in the modern enterprise. According to a Dataiku survey of IT executives, most companies have orders of magnitude more analysts than data scientists, a finding that demonstrates the need for data and the ability to build insights from that data to be accessible to (and over time, adopted by) a wider population within the enterprise.
The good news is that AI platforms like Dataiku can help create this value, allowing people from around the organization to gather data themselves (even from multiple sources) and merge it without IT intervention. To avoid falling behind or becoming burdened with data processing and integration jobs, AI platforms like Dataiku can further ease the burden on IT teams by:
- Enabling a rapid understanding of who is using data, as its size and overall expense continues to ramp up across today’s enterprise
- Handling the proliferation of data and analytics, the demand for more data, and ease at which data-driven projects can be reused
- Offloading data prep and ETL to the teams regularly using data, such as data scientists
- Supporting elasticity and resource optimization, allowing organizations to process massive amounts of data, large numbers of concurrent usage, and services deployed
The bottom line is that even if your organization has successfully put “all data” (remember, it’s hardly ever truly all data) in one place, the human work needed to transformation that data, join it together, and get it into a state where it has the potential to provide business value is not entirely automatable — at least not for now. It’s critical to develop a data architecture strategy that considers this factor and facilitates the value part of the pipeline as well.
3. Let Business Objectives Inform Your Architecture (Not the Other Way Around)
Making technical choices before considering business goals (or without considering them at all) can manifest itself as:
🚩 Architecture diagrams and plans that are more well-defined and iterated on than the business case itself.
🚩 Conversations about how to leverage the latest architecture trend that come before finding a use case on the business side.
🚩 Initiatives around driving value from data that start with IT or architecture discussions.
Any way you slice it, designing architecture first and considering the business needs after is problematic — and not just from the business perspective. It’s often the root cause of shadow IT because people will try to do things with data that the architecture just doesn’t support, but if that initiative has enough value, they’ll find a way to do it anyway.
But what does it mean in practice to let business objectives inform architecture? It means answering questions like:
- Given your business and the data you have, what kinds of things do you want to do with it? Sell it? Use it? If the latter, for what, specifically?
- What’s the business objective for collecting and storing all this data? How will it ultimately be used?
Bonus: If architecture choices are aligned with the business, the technical choices themselves become much easier. There’s a focal point and an end goal — it’s not just about subjective discussions around what architecture would be the coolest or most innovative, but what will accomplish that goal.
Dataiku’s Role in a Modern Data Architecture Strategy for Scaling AI
Dataiku plays a central role in modern data architecture strategies by providing a unified, controlled environment that empowers both technical and non-technical users to collaborate on data and AI projects.
Designed for flexibility and scalability, Dataiku integrates seamlessly with leading cloud providers, supports on-premise deployment, and leverages a pushdown architecture to utilize existing compute resources like SQL, Spark, and Kubernetes. It offers a fully managed Kubernetes solution, GPU acceleration for model training, and robust reusability features to streamline development across teams. With extensibility via 100+ plugins and tools like the Dataiku LLM Mesh, IT teams can maintain control while scaling AI initiatives across the enterprise.