Thanks for the article on What is a Data Repository and What is it Used for?Looking forward for more articles like this.
Online businesses are rapidly overtaking the revenue of brick-and-mortar businesses in today’s internet age. The changes brought forth by internet-driven communication are driving businesses to become data-driven organizations. Organizations that master how to collect and manipulate data to their advantage will triumph over their competitors. The sheer volume of data being collected by businesses today goes beyond what traditional relational databases can handle, giving rise to a series of different data repositories – Relational Databases, Data Warehouses, Data Lakes, Data Marts, and Operational Data Stores. This can be confusing for a data-architect trying to decide where to store the data due to the nuances present in each datastore.
Before proceeding onwards, a little bit of historical context may be warranted. In 1970, E.F Codd, a man considered to be the father of modern relational database design came up with the Relational Database Management System, now referred to as RDBMS. Until the end of the 1990s, relational databases remained the only option for many businesses.
Thanks to advances in internet technology, mobile phones, social media, IoT, etc., the type of data being collected and stored also changed. The era of Not Only SQL (NoSQL) databases came into existence in the early 2000s. Doug Cutting and Mike Cafarella co-founded Apache Hadoop, which became available to the world through the Apache Hadoop Opensource project. Unlike a standard relational database, NoSQL databases lack a pre-defined schema. It makes use of what is essentially a key-value pair database. While relational databases are ideal for storing structured data, NoSQL databases work much better for storing unstructured data. Companies today collect data from various sources and use it for unique purposes. The type of data coming from each source is different; it is crucial for the business to know where the data should be stored in order to effectively make business decisions.
In this blog, I will help you understand the differences between the types of data repositories – RDBMS, Data Warehouse, Data Lake, Data Mart, and Operational Data Store (ODS). While all these data repositories at their core share one common theme – efficiently store both structured and unstructured data to support reporting and analysis – they differ in their purpose, the type of data they store, and how the data is accessed.
|Feature||RDBMS||Data Warehouse||Data Lake||Data Mart||Operational Data Store|
|Purpose||OLTP||BI/Reporting||Big Data Analytics. Data Discovery||Targeted Business Analytics||Consolidated OLTP|
|Type of Data||Structured Transactional Data||Structured Data for OLAP||Structured & Unstructured||Consolidated Structured Data from Internal & External Systems||Integrated & Cleansed Data from OLTP systems|
|Data Quality||Normalized & Consistent||De-normalized & Consistent||De-normalized or Normalized & Inconsistent||Normalized & Partial Subsets||Normalized & Cleansed with some inconsistency|
Relational database systems (RDMS) are traditionally used to store structured transactional data from applications, such as CRM, ERP, HR, manufacturing and financial applications. Since the inception of RDBMS in 1970, there have been many commercial and open-source variants created, such as, Oracle, Sybase, IBM DB2, Microsoft SQL Server, PostgreSQL, and MySQL to name a few.
The basic functionality of any RDBMS data repository system is the ability to create, read, update, and delete data collectively referred to as CRUD. Data is stored in row-based tables using normalization, primary keys, foreign keys, and constraints to ensure the reliability of the data. The relational structure used to store the data allows the access and identification of data in relation to another piece of data in the database. SQL (Structured Query Language) is the querying and programming language used to access and manipulate data stored in relational database systems.
The two key features of any RDBMS are data normalization and ACID (atomicity, consistency, isolation, and durability) compliancy. Normalizing data is the process of arranging related data in multiple tables in an intended and unambiguous manner eliminating data redundancy. ACID implementation in a RDBMS preserves and guarantees consistency of transaction at the database level. It enables the developers to develop enterprise software applications without having to consider the complexities of the data integrity in the backend RDBMS. Relational databases are well suited when security, accuracy, integrity, and consistency of data is a pre-requisite.
The term Data Warehouse originated in the late 80s largely based on the white paper “An architecture for a business and information system” by IBM researchers Barry Devlin and Paul Murphy. It is customary for organizations to use multiple applications (Financials, HR, CRM, Manufacturing, etc.,) having their own backend RDBMS. Eventually, the data from these different systems will have to be consolidated in a single database to support business intelligence (BI) activities. Business intelligence or business analytics is used to drive activities that include decision support, enterprise reporting, just-in-time marketing, and ad-hoc querying. Data Warehouse is essentially a single data repository for consolidating data from all the different operating systems either in normalized or de-normalized format. Usually, a data warehouse is a purpose-built relational database running on specialized hardware either on-premise or in the cloud. Some examples of on-premise data warehouse appliances include Teradata, Greenplum, IBM Netezza, Oracle DW Appliance, and Oracle Exadata Server. These appliances are a combination of specialized hardware and software optimized specifically to host data warehousing related workloads. In an era where everything revolves around the cloud, Snowflake, Google BigQuery, Microsoft Azure SQL Data Warehouse, and Amazon Redshift are some of the examples of cloud-based data warehouses. Cloud-based data warehouses offer organizations the benefit of scaling up on demand while not incurring capital expenditure on on-premise infrastructure and ongoing maintenance.
Ralph Kimball, the founder of The Kimball Group Data and a recognized leader in the DW/BI field once said, “…The data warehouse is nothing more than the union of all the data mart”. Data Mart is essentially a simplified and cheaper version of a data warehouse. While data marts are frequently confused with data warehouses, each of them serves markedly different purposes. Data marts typically contain targeted data for decision support based on the needs of a particular department, such as sales, marketing or finance. While data warehouses are used to make strategic decisions that might impact the entire organization, data marts are used to make tactical decisions that might only impact a specific business process or department. A data mart data repository can be built either using the data from a data warehouse or from other data sources. The data contained within a data mart is highly curated and might be either normalized or de-normalized based on the business needs. The most distinguishing characteristic of a data mart is the utilization of the star schema configuration. Star schemas consist of one or more fact tables referencing many dimension tables that resemble the shape of a star, hence the name.
The term data lake is relatively new and the interpretation of its function is that it is a garbage can for data. To clarify why I use the term garbage can, it’s necessary to understand that data lakes are essentially a data repository dumping ground for all types of data, both structured and unstructured in their native format. When an organization collects or receives a piece of data that cannot be properly cataloged and classified, it is normally dumped into a data lake either on a temporary or permanent basis. Data lakes do not have a set structure similar to a traditional database because the data stored in a data lake is in a raw or unrefined form, the purpose of which is yet to be determined. Data lakes have no pre-defined schema structure, they require little or no maintenance and are cheaper to setup. Data lakes are normally built on top of NoSQL (Bigdata) database, such as Apache Hadoop. NoSQL databases do not require a pre-defined schema and they do not have to adhere to ACID characteristics of a relational database. Data is stored as a schema-less key-value pair, schema, and data requirements are not defined until the data is actually queried. Data is split into shards across multiple nodes built using commodity hardware, which also provides fault-tolerance and redundancy. In a data lake that is built on the Hadoop platform, data is queried using map/reduce jobs. Open source or third-party addons such as Hive or HBase support SQL queries by converting SQL into map/reduce jobs. The data in a data lake is highly uncrated and might originate outside of a company’s operational systems such as in social media or mobile devices and is not suitable for business intelligence-related activities in its raw form. Amazon, Microsoft, Oracle, Teradata, MongoDB, and Cloudera are some of the vendors that market different variations of a data lake solution with proprietary data management add-ons.
Operational data stores are often confused with data warehouses as both of them are used for consolidating data from multiple enterprise operational systems. However, that is where the similarities end. While both the systems do store operational data, each of them stores the data differently and serves different purposes. An ODS is used to store detailed transactional data from different operational systems on a short-term basis. It can serve as some type of an operational system by itself or as an interim staging area before the data is cleansed, processed and loaded into a data warehouse. Operational data stores normally receive data on a continuous basis from other systems either through real-time data replication or via batch extract-transform-load processes. Data is normally stored in a denormalized format. Operational data stores are ideal for querying small datasets to satisfy real-time or near-real-time reporting or ad-hoc querying needs. Operational data stores are synonymously referred to as master data management systems (MDM) as they are used to create common sets of data that the company needs to conduct day-to-day business activities.
Now that I have explained the different types of data repositories and the differences between each of them, any one or all of these data stores could contain both structured and unstructured sensitive data. STEALTHbits Data Access Governance solutions can assist companies in discovering, classifying and securing sensitive data.
Sujith Kumar has over 25 years of professional experience in the IT industry. Sujith has been extensively involved in designing and delivering innovative solutions for the Fortune 500 companies in the United States and across the globe for disaster recovery and high availability preparedness initiatives. Recently after leaving Quest Software/Dell after 19 years of service he was working at Cirro, Inc. focusing on database management and security. His main focus and area of interest is anything data related.
Sujith has a Master of Science in engineering degree from Texas A&M University and a Bachelor of Science in engineering degree from Bangalore University and has published several articles in referred journals and delivered presentations at several events.
Proper data security begins with a strong foundation. Find out what you're standing on with a free deep-dive into the security of your Structured and Unstructured Data, Active Directory, and Windows infrastructure.Read more
Start a Free Stealthbits Trial!
No risk. No obligation.