Browse Definitions :
Definition

document sanitization

What is document sanitization?

Document sanitization is the process of cleaning a document to ensure that only the intended information can be accessed from it. In addition to making sure the document text doesn't openly divulge anything it shouldn't, sanitization includes removing metadata that could pose a privacy or security risk.

Document sanitization, sometimes called file sanitization, refers to the cleansing of documents by removing hidden content from them. In addition to metadata, the content that is removed may include document properties, hazardous code such as malicious scripts or backdoors, or malware that hasn't previously been detected.

The properties of metadata.
Document sanitization removes all hidden and sensitive content, such as metadata, and code from documents.

Sanitization is not the same as redaction. Where sanitization is about removing hidden data and metadata, redaction is about removing private or sensitive information that should only be available to a specific group of people. With sanitization, all hidden information is permanently removed so that the file can be safely passed on. With redaction, certain text gets permanently removed so it is no longer viewable or editable.

Despite these differences, the two activities work well together. Combining sanitization with redaction helps to better protect documents and their information from leaks and breaches.

The importance of document sanitization

Documents often include hidden content that may not have been detected previously. If hackers or cybercriminals are able to access this information, they may be able to use it to steal other types of sensitive data, such as passwords, personally identifiable information or financial information. They might also use the information to embarrass a firm or damage its reputation.

To minimize the risk of such incidents, it's important to find and remove all hidden content in documents, which is any information that's not intended to be distributed. Proper and thorough sanitization helps ensure that potentially sensitive information is not inadvertently or maliciously seen when the document is published or shared. It thus protects the organization from data breaches.

Document sanitization and metadata removal

Metadata is often described as "data about data." Different types of metadata provide additional information about a document, such as its author, history or version. Metadata can also contain the names of a document's modifiers, the dates of creation and changes, file size, digital signatures, revision histories, tracked changes, watermarks, headers, footers and comment exchanges among various authors and editors.

Metadata, which is usually not obviously visible to the document's authors, viewers and editors, is important for document tracking, classification, analysis and management. It also improves collaboration among users sharing a file or collection of files. But it could also contain sensitive information that might embarrass or damage an organization, so it's important to safeguard metadata from unauthorized access. This requires removing it from the document before it is published, shared or circulated.

Common metadata standards.
Metadata standards guarantee uniformity on the shared language, format, spelling and other aspects used to describe data. Each standard is based on specific schema, providing an overarching structure for all its metadata.

How to sanitize a document

A common way to remove metadata and other hidden information from a document is to convert it to PDF format and then follow the sanitization procedure provided by the specific PDF application used.

For example, the procedure for sanitizing PDF documents in Adobe Acrobat is the following:

  • Access the Redact top menu and select Sanitize Document.
  • To remove all hidden information, select OK.
  • To remove specific pieces of hidden information, select Click Here.

Non-PDF documents can also be sanitized. Software such as Microsoft Excel and Word both have built-in features to discover and remove hidden content like metadata.

The National Security Agency provides recommendations for sanitizing Word documents. In 2005, the Agency published Redacting with Confidence: How to Safely Publish Sanitized Reports Converted from Word to PDF, highlighting a seven-step process to safely sanitize Word documents:

  1. Create a copy of the original document. All edits should be made to the copied version only. The original should be retained as-is as a backup.
  2. Turn off track changes, comments and other visible markups on the copy. Review and remove all sensitive content.
  3. Rename the document.
  4. Review the document to ensure that all material to be redacted has been deleted and, wherever necessary, replaced with innocuous filler (e.g., empty shapes to replace sensitive images).
  5. Open a new blank document and copy data from the document copy into the new document.
  6. Convert the new Word document to PDF format.
  7. Review the PDF for any missed redactions.

Automated document sanitization

As noted, manual document sanitization can be a somewhat complex process, with the possibility of misses and errors. Automation can prevent errors as well as ensure more thorough sanitization and better protection for the document and organization.

Automated sanitization software use algorithms that detect terms and term combinations in a document that might potentially disclose sensitive or confidential information. Users of the sanitization applications define which topics are deemed sensitive. The terms are then redacted from the document. If the user pre-defines privacy requirements, the software can also generalize the risky terms in line with those requirements.

Effective sanitization applications can sanitize documents of different formats, including Word, PDF, Excel and PowerPoint. By removing hidden data, these products help to safeguard information from leaking outside the organization. In this sense, they can be considered important for data loss prevention.

An effective data sanitization process lessens the chance that your organization's valuable data could be stolen or compromised and enhances compliance. Explore these data sanitization techniques, including standards, practices and legislation.

This was last updated in May 2024

Continue Reading About document sanitization

Networking
Security
  • personally identifiable information (PII)

    Personally identifiable information (PII) is any data that could potentially identify a specific individual.

  • zero-day vulnerability

    A zero-day vulnerability is a security loophole in software, hardware or firmware that threat actors exploit before the vendors ...

  • DNS attack

    A DNS attack is an exploit in which an attacker takes advantage of vulnerabilities in the domain name system.

CIO
  • data collection

    Data collection is the process of gathering data for use in business decision-making, strategic planning, research and other ...

  • chief trust officer

    A chief trust officer (CTrO) in the IT industry is an executive job title given to the person responsible for building confidence...

  • green IT (green information technology)

    Green IT (green information technology) is the practice of creating and using environmentally sustainable computing resources.

HRSoftware
  • diversity, equity and inclusion (DEI)

    Diversity, equity and inclusion is a term used to describe policies and programs that promote the representation and ...

  • ADP Mobile Solutions

    ADP Mobile Solutions is a self-service mobile app that enables employees to access work records such as pay, schedules, timecards...

  • director of employee engagement

    Director of employee engagement is one of the job titles for a human resources (HR) manager who is responsible for an ...

Customer Experience
  • digital marketing

    Digital marketing is the promotion and marketing of goods and services to consumers through digital channels and electronic ...

  • contact center schedule adherence

    Contact center schedule adherence is a standard metric used in business contact centers to determine whether contact center ...

  • customer retention

    Customer retention is a metric that measures customer loyalty, or an organization's ability to retain customers over time.

Close