Not sure what Hadoop actually is? A little fuzzy on what the difference is between cloud and on-prem storage? We’ve set out to demystify the jargon surrounding data architecture to enable every team to understand how it impacts their objectives.
The Data Architecture Domain
It’s useful to divide data architecture concerns into two main goals: providing stability and continuity of service as a user base (and data stores) scale, and ensuring that data and processes are securely protected from bad actors or sloppy mistakes.
These umbrella priorities work in tandem to create, support, and govern modifications of the data pipeline, and underpin all the strength and business value of data initiatives. Just like with actual architecture, a well-structured data architecture can make data science and analysis a stream-lined joy, while incompatible architecture makes every effort a Herculean task.
See what data architecture terms you’re familiar with, and which ones you could use a refresher on:
Scalability Key Terms
—Distributed Systems segment the storage and compute resources of a system onto different machines that are able to run in parallel, thus speeding up work and minimizing the risk of single points of failure. The building blocks (storage and computation) of distributed systems are Nodes. A Cluster is a collection of multiple nodes.
—Data Partitioning: The act of splitting data into segments that are more easily maintained or accessed. For example, distributed systems partition data in order to improve scalability and optimize performance.
—Data Replication: The act of copying the same data multiple times to different systems. Synchronous replication happens when writing data to primary storage and a replica or backup simultaneously. Asynchronous replication means that data is copied to a replica or backup after it is already written to the primary storage location.
—Distributed systems are known for being both Highly Available (meaning the ability of a system to function continually without failure) as well as Fault Tolerant (the ability to maintain usage during the failure of a component).
—Hadoop is a popular cluster-based open-source framework for distributed storage. Different data processing engines or frameworks that can be used on top of Hadoop include (but are not limited to) Hadoop MapReduce and Apache Spark.
—Storage Resources are the data storage capabilities of an architecture and a main requirement for all data science operations. Along with compute resources, these are a good indicator of an architecture’s ability to scale.
—Compute Resources are the processing capabilities of a system that are available to perform computational work (e.g. execute programs, carry out analysis, etc.).
Security Key Terms
—AAA stands for authentication, authorization, and audit, which are the three pillars of data architecture security.
—Authentication is the security process by which a user or process confirms its identity. This can occur in one location (as is the case with SSO) or through multi-factor authentication.
—Authorization is the security process by which the system gives a user or process the ability to read and write data or execute programs within certain parts of a system. Depending on the user’s clearance level, this may represent a small section of data relevant to their processes or could include the ability to act as an admin and authorize other users.
—Audit is the ability to trace and review everything that’s been done within the system.
—Data Lineage covers the entire path data takes from creation to storage to analysis. It also incorporates who “owns” what data at a given time, which is critical transparency for compliance regulations.
—User Permissions enable users to work with a certain set of data. They come in three stages: Read, where a user can consume the data stored in the system; Write, where a user can modify data; and Execute, where a user can execute commands and programs that can impact the database structure.
Take the Next Step
Data architecture is not just for architects themselves; users on any team can benefit from understanding an overview of the ways their data insights are generated, and how they align with business priorities. If you could use a review on some of these terms (or how they work together to support data initiatives) consider engaging in a broader data architecture education.