Classification: that essential aspect of data protection

Data is the 21st century’s black gold: an observation you won’t be particularly surprised to hear. The fact that it is ever-more exposed (through the increasing use of APIs and SaaS applications such as Office365, Salesforce, Shadow IT, etc.) and therefore at greater risk, won’t surprise anyone either.

The question is no longer whether data can leak (intentionally or not) and be misappropriated, but rather, how to secure it, and limit the impact when it does leak.

Against a backdrop like this, security models need to evolve. The castle model is now largely outdated, and is being replaced by that of the airport. Data-centric protection then becomes an imperative. And such protection also has to meet the daily needs of those same users who worry about being affected.

 

2 the different types of data … And the different approaches they require

The large data-protection projects launched by major players all face the same problem: how to decide how sensitive a given piece of information actually is. The answer to this question is key: it’s this that determines the relevant level of protection needed to avoid data leakage.

Today, there are two broad types of data:

  • Structured data, which refers to all information that follows a particular format, and is easily identifiable as such: a CRM field, social security number, official certificates, and email addresses, as well as a host of other data that can be expressed in a clearly defined format (1). Typically, this information is found in the databases of applications.
  • Unstructured data, which can exist in any format (such as MS Office documents, PDFs, images, videos, music, business application files, etc.). It should be noted that data which, at first glance, might be considered structured (for example, the telephone field of a CRM), may not be so if the format in which the data is entered is not followed strictly.

Structured data can be easily identified, and its sensitivity assessed according to predefined norms; but unstructured data presents a problem of a whole different magnitude—and it’s mostly this type of data that employees generate day to day. In concrete terms, this translates into an inability of the relevant security tools (such as: Data Loss Prevention/DLP) to identify a leak or the misappropriation of vital information.

The classification of unstructured data, then, represents the cornerstone of any data protection strategy—and it’s something that has to be done manually by end users.

 

But what is classification?

“Data classification” means the entirety of the technical and organizational processes used to categorize information produced by the employees of an organization. Following the categories defined – according to levels of sensitivity (for example, internal, confidential, secret, etc.) or by relevant organizational functions (such as HR, R&D, Purchasing, etc.) – classification allows data to be placed within the appropriate regulatory, legislative, or security framework.

Historically very basic (for example, a checkbox in a header or on the first page of a document, or the manual addition of metadata), classification consolidates data, and makes users responsible, by placing them at the center of the process, while, at the same time, offering them an improved experience (a simple interface and clear advice).

In practice, classification tools offer a diverse range of functionality:

  • For new files, either manual or automatically determined classification according to predefined rules (for example, the presence of a certain number of social security numbers);
  • For existing files, the manual scanning of files stored in local directories or on premises, according to predefined rules;
  • The addition of metadata (or tagging) to the file: this metadata, which can be interpreted by third-party tools, unlocks visibility for supervisory tools such as Data Loss Prevention;

The addition of visual marking elements (such as headers, footers, and watermarks) to raise awareness among end users.

 

The results of classification projects have been inconclusive so far

RSSI procedures tend to take into account issues of data classification, and the issue is core to most major corporations’ policies. This imperative is reinforced by recent regulations such as the GDPR or the French Military Programming Act (LPM) which require the mapping of data and uses. However, few organizations, other than banks, have successfully implemented effective classification strategies.

There are several reasons for this gap:

  • End users are generally not aware of the nature of the sensitive data or its impact: while the highest classification levels (“C4”, “Secret”, “Confidential”, etc.) are used for documents likely to put companies, or even entire Groups, at risk; these usually represent about 1% of all such information – although this proportion is close to 10% in some companies. Conversely, it is not uncommon for a user to share files containing sensitive personal data, or passwords, without any classification or protection.
    Thus, any data-classification project requires strong change-management support for end users. This should use clear messages and concrete examples, that allow users to classify information easily. Periodic recaps will also be needed to remind users what constitutes good practice. In fact, a user who handles sensitive data—day in, day out, may no longer be aware of the impact of this data being compromised.
  • If they fail to provide users with sufficiently ergonomic approaches, companies cannot expect solid results. Experience shows that checkboxes for classification levels on cover pages, headers, or footers are only rarely selected.
  • The classification of the entirety of a company’s data is a transformation project in its own right and requires strong commitment from functional and corporate teams if it is to be widely delivered. This commitment must be even greater if the classification strategy that has been defined impacts users (through obligations to classify documents, use encryption, etc.).

 

Classification takes center stage again

The topic is back, in force, with large corporates, driven by digital transformation programs—requiring the rethinking of data protection, and with the large players in the market—who are shaping their offerings around the subject. Some analysts, like Gartner, even foresee the consolidation of data-protection solutions into a single, classification-centric solution.

Awareness and ergonomics will need to be combined, if such approaches are to be successful and end users are to buy into the process. The two will need to work together – hand in glove.

 

In a future article, we’ll be looking at how the market is evolving for historical security players, and how the implementation of an effective classification strategy can provide a springboard for new impetus in data protection.  

 

(1) A regular expression is a string of characters that corresponds to a specific syntax. For example, a French phone number can have one of three formats: 0123456789, +33123456789 or 0033123456789.

Back to top