In the File Systems Data Collector for StealthAUDIT, we collect various types of information about files and folders including permissions, file size, activity data, sensitive data, etc. One of the most important aspects of a file system resource (file, folder, or share) is “does that resource still exist”? While this might on the surface seem like one of the easiest things to collect, there was a range of mitigating factors that limited the accuracy in which we could report on this information. With the release of StealthAUDIT 10.0, we vastly improved the accuracy of our reporting of deleted items in file systems collection.
The initial implementation of the File System Data Collector had one scan job that could be run to collect permissions information. In order to optimize collection and keep data consistent, we utilized an Update Sequence Number or a USN. Each complete scan had a USN that incremented by one from the previous scan. Each resource would have a USN value associated with the record as well. When a resource was newly discovered or some aspect of the resource’s information was changed, the USN would be set to the USN of the current scan. Roughly speaking, the USN value for a resource would be the scan number in which the resource was last changed.
To deal with deleted resources in permission scans, the solution was relatively simple: in addition to the USN column, add a DeletedUSN column. This DeletedUSN value would be the USN number in which a previously existing resource disappeared. This is easy to detect because when we scan a folder in a file system, we enumerate the folder’s contents completely. We can compare the folder’s contents on all previous scans with the current contents of the folder. Any resources that previously existed that are no more will have a DeletedUSN value of the current scan, any resources that exist will have a DeletedUSN value of NULL. All a query has to do to filter out deleted resources from a result set is add a simple “WHERE DeletedUSN IS NULL” to the query.
With the growth of the data collector led to two additional scan types being added: activity scans and sensitive data scans. Since these scans were different use cases, it made sense to keep them separate from each other. This separation led to each scan type having its own database where it keeps track of the resources pertinent to that scan type. Later the information from this intermediate “Tier 2” database set is imported into the main SQL server so the user can generate reporting on the collected data.
However, this “scan-type specific” accounting of resources posed a problem for the DeletedUSN model of accounting for deleted resources. A scan will mark a resource as deleted when it no longer sees a resource that it has seen before, but what if one scan type sees it and another doesn’t see it? And how do you reconcile when one scan thinks a resource is deleted but another does not? For example, an activity scan may have an activity event that says a file was marked as deleted on a given date, but a permission scan may have found the resource as alive and well during the last permissions scan. The current framework does not allow us to reconcile this conflict to arrive at the truth.
The simple answer was to limit the scope of marking things as deleted to a permissions scan. This allowed us to say that a resource is marked as deleted only when a permissions scan notices the resource disappearing, otherwise, we will assume the resource still exists. This posed a problem of its own. What if the permissions scan never knew of the resource existing in the first place? If a resource was picked up by an activity or sensitive data scan that was never seen by a permissions scan, the permissions scan would never be able to mark it as deleted because it has no knowledge of the resource in its dedicated dataset. Another solution pondered was to get rid of the independent datasets for each scan and allow each one to see the other data. However, the independent resource data was a core assumption to the scanning model, and getting rid of this would be a large undertaking and would get rid of concurrent scans of different types – an important use case for customers.
The solution that we arrived at was to give more granular data on the times in which each scan either saw a resource or noticed that it was missing/deleted. Rather than relying just on a simple DeletedUSN value, this would allow us to figure out which event was the most recent. The new relevant columns in the SQL database would look like:
AccessLastSeen | AccessLastDeleted | ActivityLastSeen | ActivityLastDeleted | DLPLastSeen | DLPLastDeleted |
01-01-20 10:00 | NULL | 01-01-20 9:00 | 01-02-20 10:00 | NULL | NULL |
Using all this information, we can see what the most up-to-date status we have is on this resource. An access (permissions) scan last saw it on January 1st at 10:00AM and has not noticed it missing. An activity scan last saw it on January 1st at 9:00 but noticed it was deleted as of January 2nd at 10:00AM. DLP (sensitive data) scans, in this case, have never encountered this resource, either way, so they cannot add to the picture. If we are to combine this information all together, we can clearly convey that our most up-to-date information is from January 2nd at 10:00AM and as of then, the resource was deleted.
Michael Rubacky is a Software Developer on the File Systems for StealthAUDIT team at Stealthbits and has been with the company since 2015. He is a graduate of Fordham University and at Stealthbits has been focusing on Sensitive Data Discovery. In his free time, he enjoys rock climbing, backpacking, and the general outdoors.
Adopting a Data Access Governance strategy will help any organization achieve stronger security and control over their unstructured data. Use this free guide to help choose the best available solution available today!
Read more© 2022 Stealthbits Technologies, Inc.
Leave a Reply