Data Classification is the process of categorizing datasets (e.g. files, databases, etc.) into logical groupings that makes what’s inside the data instantly understandable and contextual on the surface.
In other words, it’s labeling. As a practical example, instead of having to open a file to examine its content and determine there’s sensitive data inside, a classification could be applied to one of the file’s available metadata tags to denote the contents within are “Sensitive” or contains data subject to “HIPAA” compliance.
However, the concept of Data Classification can sometimes become confusing. This is generally due to the term not being standardized in the space.
In practice, data classification usually evokes one of two use-cases: determining what type of information is in a piece of data or marking/tagging a piece of data based on content determination. Both of these are important in the overall data governance plan within an organization for different reasons.
Frequently, data classification is associated with identifying the type of information there is in a piece of data. This usually falls into two different areas:
This is used especially in highly regulated industries, or in organizations that have to engage with a specific form of compliance. This generally can rank based on audience. For instance, determining whether something is confidential, top-secret, etc. This is accomplished by identifying the appropriate pieces of information necessary to identify a classification type, whether that’s intellectual property, personal, private, or protected information of an individual or the organization.
This is a method that can be complex but has many ways to scale up and automate. When patterns, words, and phrases can be defined, regular expressions can then be used to find most data points in a data set and validation methods tacked on can add an extra level of confidence to the analysis of the data.
Another option is determining the theme of content in a data set and using that to mark a file. For instance, if a file is a mortgage application, it may fall into the category of “finance”, or an offer letter may fall under the category of “Human Resources”. This becomes a more complex way to classify data than simple regular expressions. This requires a very heavy human component to classify content as it is created, or a human to go through and evaluate the content at a later point. There are interesting technologies out there with natural language processing and machine learning that help with this, but overall it’s likely to be a very manual process.
As part of a true data governance strategy within an organization, files that are deemed sensitive or categorized will benefit wildly from some level of marking of that content. This can take form in a few different ways:
In this situation, an actual tag is placed on the file in some way. For example, Microsoft Office files offer this type of tagging directly on the file – it ends up adding data to a file in the envelope that is an Office file. By adding a tag in this manner content can be programmatically searched for or directly interfaced with a Data Loss Prevention (DLP) solution for the implementation of Rights Management. This can be done manually in the file itself or in bulk when done programmatically with a product (such as StealthAUDIT) or from an interface like PowerShell.
One of the weakest methods, yet most common, of marking a file as a certain type is by a hierarchical taxonomy implemented, most commonly, but not always, structured in terms of folders. In this instance, there’s usually a high-level structure of separating content (by the department, by date, by sensitivity, etc.) with several subsections. This frequently leads to an overlap of content in multiple locations, or a lack of overlap when it is needed, which causes increased difficulties in locating content when it becomes necessary to secure or remove content. This method is probably the most common due to it being the first real method in file systems and the ease of implementation but also is the least useful from a classification perspective.
Many modern collaboration platforms offer the ability to add additional metadata to content within the platform without changing the file itself – SharePoint is an excellent example of this, though others exist like Box, Dropbox, and Google Drive as a few others. In these platforms their options to add a different level of metadata to a file that will allow improved searchability and classification, as well as keying into different features of the platforms.
Based upon this information, one could surely think of a slew of reasons why classifying data would be important and beneficial. However, real-world adoption of the practice has been underwhelming, at best. Among other valid reasons, failure to implement a successful Data Classification program has largely been attributed to a refusal to participate in the process by the data creator/owner. Data Classification solutions have traditionally focused on classifying data at the time of creation, thus requiring a fundamental change in business processes, which users typically resist.
Additionally, traditional Data Classification solutions have ignored the troves of files that already exist within the environment (many terabytes in most medium and large businesses). This has created a proverbial “line in the sand” between new and old data, of which new data represents only a very small minority with respect to the overall amount of files.
So to summarize…
Process Change + Minimal Coverage = Non-Starter
But that was then, and this is now. Solutions now exist to examine and classify all of that legacy data automatically. There have also been drastic improvements made to the “classification-at-creation” process that make it far less arduous for the user to do what’s right for the organization without changing their behavior significantly. And it’s just in the nick of time, because what organization doesn’t want to do the following?
How is any of this ever going to work if you don’t know what’s inside these files and you don’t provide a means by which other technologies can easily understand as well? How can you protect your most sensitive data if you don’t know where it is? Conversely, how can you move all the “non-critical” data to the cloud if you don’t know which files are indeed “non-critical”? How will you be able to take advantage of the advanced capabilities of a User and Entity Behavior Analytics (UEBA) solution to effectively risk-rank your threats when it has no idea which files put you at risk in the first place?
Everyone agrees that you can’t manage what you don’t know about, so denying the necessity to undertake data discovery and classification is really just denying the successful implementation and completion of any number of critical initiatives now and in the future.
If you’re convinced of (or at least curious about) the concept, here are some tips to get others on board with the program.
There is perhaps no more effective way to justify the need for a data classification policy than to demonstrate how little visibility there is into what’s inside the files you already have. Pick a small handful of file shares and scan them with a sensitive data discovery tool. It’s likely to light up like a Christmas tree with sensitive data that could be much more easily tracked and secured if it was tagged properly. Also, be sure to highlight how many individuals have access to the data and what level of permission they have.
Ransomware poses a persistent and serious threat to file system data, and the costs associated with a successful ransomware attack extend far beyond the ransom itself (e.g. backup software and storage, manpower, lost productivity, etc.). Being that not all data is created equal, it’s particularly important to know the whereabouts of your most critical files in order to make proper decisions about how to recover from a ransomware scenario. Is it possible you’d handle the situation differently if you knew there was (or was not) critical data at risk?
Classifying data has immediate and substantial impact on a variety of strategic programs within any organization, from Data Loss Prevention (DLP) to cloud migration initiatives. Here are just a few examples of how data classification increases value in existing investments and accelerates high profile programs:
With a slew of compliance regulations like EU GDPR, CCPA, HIPAA, SOX, and PCI-DSS to adhere to, knowing which files contain data subject to any standard, where they are, who has access to them, and who is interacting with them could be made much simpler by incorporating proper data classification into the mix. Knowing where the data “should” be is ultimately inconsequential when the data is assuredly contained in places it should not be. Proactively identifying data contained in the wrong location with simple file attribute searches is exponentially easier and faster than continually scanning file contents to achieve the same result.
The fact that every organization hoards data is only exacerbated by the myth that “storage is cheap”. In reality, storage is far from cheap and accounts for hundreds of thousands – any in many cases, millions – of dollars spent every year for any individual organization.
Part of the reason so many organizations opt to hold on to stale data that likely provides no meaningful business value is the fear of deleting data that needs to be retained for legal purposes. Data Classification makes the process of identifying stale files that lack data subject to retention hold a simple task, enabling organizations to confidently delete or retire troves of data that no longer needs to be consuming valuable, costly storage space. The cost savings from storage reclamation alone are likely to justify the cost of a data classification program many times over.
Stealthbits’ Data Classification Software not only identifies where your most sensitive data lives, but who has access to it and how, who is accessing it, and what they’re doing with it across file systems, SharePoint, cloud repositories, Exchange, SQL and Oracle databases, and more. This is accomplished in four stages:
Stealthbits’ sensitive data criteria editor is flexible and robust, providing direct access to hundreds of preconfigured criteria sets that can be copied or customized, as well as created from scratch to accommodate keywords and patterns specific to your organization.
After sensitive data has been identified, automated tagging and integration with alternative solution providers are available to keep your classifications persistent.
Sensitive data can exist virtually anywhere in an organization – in shared repositories like file shares and SharePoint sites, O365, Dropbox accounts, Exchange mailboxes, SQL and Oracle databases, and even desktop and server infrastructure.
With the ability to scan the contents of over 400 different file types, as well as databases, Stealthbits provides the broadest visibility into all the most common places sensitive data exists.
When responding to Data Subject Access Requests (DSARs) or needing to know where a particular type of sensitive data exists, Stealthbits’ makes searching your sensitive data results as easy as your favorite internet search engine.
Find everywhere your match exists across all scanned repositories along with who has access, how, and who has accessed the data.
With understanding of all files containing sensitive information, the ability to monitor all file activities in real-time, and AI-powered unsupervised machine learning algorithms figuring out what’s normal and abnormal user behaviors, Stealthbits can easily spot unusual or nefarious sensitive data access activities to stop attacks in their tracks.
Is Data Classification worth the effort? You be the judge. Are you concerned about security and compliance? Would you like to increase the ROI of multiple technology investments you or your organization have made? Do you think there is untapped potential and intelligence to be derived from the terabytes of data withering away in your environment already?
Not unlike anything else that’s worth doing, Data Classification requires some tough decision-making and planning. It’s important to keep in mind, however, that every compliance standard, every breach, and virtually every initiative comes down to the same thing; your data. Knowing your data is knowing what to do with it.
As General Manager, Adam is responsible for product lifecycle and market adoption from concept to implementation through to customer success. He is passionate about market strategies, and developing long-term path for success for our customers and partners.
Previously, Adam served as CMO and has held a variety of senior leadership positions at Stealthbits – now part of Netwrix including Sales, Marketing, Product Management, and Operational Management roles where his focus has consistently been setting product strategy, defining roadmap, driving strategic engagements and product evangelism.
Adam holds a Bachelor of Science degree in Business Administration from Susquehanna University, Selinsgrove, PA.
Proper data security begins with a strong foundation. Find out what you're standing on with a free deep-dive into the security of your Structured and Unstructured Data, Active Directory, and Windows infrastructure.
Read more© 2022 Stealthbits Technologies, Inc.
Leave a Reply