Cato Networks Knowledge Base

Configuring User Defined Data Types for DLP

  • Updated

This article explains how to create custom data types to identify sensitive data in your organization for the DLP policy.

Overview of User Defined Data Types

Cato provides hundreds of pre-defined out-of-the box data types and categories for typical scenarios of DLP policies. However, sometimes organizations require the ability to create custom defined data types to match specific data inspection which is not covered by the pre-defined type.

You can define custom data types that use keywords and dictionaries to customize content inspection for your DLP policies. Keywords are items that contain one word or phrase that the DLP engine searches for. Dictionaries are a container that contain up to 50 word or phrases, and the DLP engine searches to match any single item in the dictionary.

Regex data types let you enter regular expressions that define the content that DLP engine searches for.

After creating the User Defined data type, you can add them to existing DLP Content Profiles or create new ones.

Creating New Keyword and Dictionary Data Types

Create a custom keyword or dictionary for the custom sensitive content that the DLP engine is searching for. For dictionaries, you can maintain the entries in a CSV file, and then paste them as the values for that dictionary.

  • DLP engine searches for an exact match of each keyword or dictionary entry

  • No limit for the number of words or characters in a keyword

  • Keywords and dictionaries are NOT case sensitive

  • Entries in a dictionary have an OR relationship between them

  • Phrases must be an exact match on each word, for example the phrase health care doesn't match healthcare

    So for a dictionary, you would create the following three values to match the words above: health, care, healthcare

  • Words and phrases are identified according to standard word boundaries, for example a space after a word. For a complete list of supported word boundaries, see below Word Boundaries for Keyword and Dictionary Data Types

Working with Thresholds

You can define the Threshold for each User Defined data type, the number of times that the keyword or dictionary matches in a file. When it matches or exceeds the Threshold, then the file matches the Data Control rule (in the Security > Application Control screen).

  • Keywords - The Threshold for keywords looks for repeat occurrences that are an exact match of that word or phrase.

    • For example, for the keyword apple with a Threshold of 3. If a file contains 3 instances of the word apple, then that file is blocked.

  • Dictionary - The Threshold for dictionaries looks for repeat occurrences of ANY value in that dictionary.

    • For example, if the dictionary contains the entries apple and orange with a with a Threshold of 3. If a file contains 2 instances of the word apple and 1 instance of the word orange, the file is blocked.

      Also, if a file contains 3 instances of the word apple and 0 instance of the word orange, the file is blocked.

User_Defined_Data_Types.png

To create a User Defined data type:

  1. From the navigation menu, select Security > DLP Configuration, and expand User Defined Data Types.

  2. Click New and then select New Keyword or New Dictionary.

  3. To create a New Keyword:

    1. Enter the Name and Description for the keyword.

    2. Select the Threshold, the minimum number of times that the keyword appears in the file.

    3. Enter the Keyword/Phrase.

    4. Click Apply, and then click Save.

  4. To create a New Dictionary:

    1. Enter the Name and Description for the dictionary.

    2. Select the Threshold, the minimum number of times that one of the dictionary entries appears in the file.

    3. Add (or paste) one or more values for the dictionary. Multiple values must be separated by commas.

    4. Click Apply, and then click Save.

New_DLP_Dictionary.png

Word Boundaries for Keyword and Dictionary Data Types

To match a keyword or phrase, the DLP engine uses standard word boundaries to identify the end of each word. These are the characters that the engine recognizes as word boundaries:

  • ([\s,.:;“‘]|^)

Creating New Regex Data Types

Use regular expressions to define the type of content that matches the Data Type. For example, regex formulas let you easily match a customized corporate ID with a specific number of digits. Each Regex Data Type supports a single regular expression, so if you need to use multiple regular expressions, create a separate data type for each expression.

Use word boundaries in the expression to correctly define the content that matches the Data Type.

The regex engine is based on UTF-8 and supports characters for non-English content.

Regex Thresholds

You can define the Threshold for the expression, the number of times that the content appears in a file. When it matches or exceeds the Threshold, then the file matches the Data Control rule.

For example, if you created an expression for an ID with a Threshold of 5, then only files which contain the ID five or more times would be blocked.

Validating Regular Expressions

You can use the Validate Expression field to test the expression and make sure that it matches the content correctly. When you click Test, the DLP service checks if the content matches the regular expression. This is the same service that runs in the Cato Cloud, so the test results are the same behavior you will see in your account.

Validating the expression also includes the Threshold for the Data Type. So when the Threshold is greater than 1, the value must appear at least that many times for the test to succeed.

Regex_User_Data_Type.png

To create a User Defined Regex Data Type:

  1. From the navigation menu, select Security > DLP Configuration, and expand User Defined Data Types.

  2. Click New and then select New Regex.

  3. Enter the Name and Description for the keyword.

  4. Select the Threshold, the minimum number of times that the text that matches the Expression appears in the file.

  5. In Expression, enter the regular expression for this Data Type.

  6. (Optional) Expand Validate Expression, enter the text and click Test.

  7. Click Apply, and then click Save.

Supported Operators and Quantifiers

These are the regular expression operators and quantifiers that are supported for the User Defined Regex Data Types:

Operators

Matched Pattern

\

Quote the next meta-character

^

Match the beginning of a line

$

Match the end of a line

.

Matches any single character

|

Alternation

()

Capture groups are not supported. Parentheses can be used for bounding sub-expressions.

[xy]

Matches a single character from those given between the brackets

[x-z]

The range of characters between x and z

[^z]

Any character except z

Quantifiers

Matched Pattern

*

Match 0 or more times (see note below)

+

Match 1 or more times (see note below)

?

Match 0 or 1 time

{n}

Match exactly n times

{n,}

Match at least n times

{n,m}

Match at least n times, but not more than m

Note

Note: The use of unrestricted greedy quantifiers of arbitrary characters such as, .* or .+ are not allowed. If you are attempting to include the characters in a class or set, reverse them. For example, *.

Instead of using these greedy quantifiers, you can use .{1,50} that supports up to 50 characters for each keyword or pattern for the regex data type

Best Practices for User Defined Data Types

  • When you implement the policy, or add a new application with the Block action:

    • Use the Monitor action for the rule.

    • Review the events that the rule generates and make sure that there are no events for traffic that you want to allow (false positive traffic).

    • If there is false positive traffic, you can make these changes:

      • Refine the scope of the rule to exclude the false positive traffic

      • Create a new allow rule before the block rule, and the scope of the new rule is only for the false positive traffic

      • Refine the regular expression and make sure that you validate it with an accurate example of the content you are scanning for

  • Remember that the Application Control policy is an ordered policy, and the final implicit rule is ANY ANY Accept. Add rules to the policy to block the relevant application traffic, activities and criteria.

Known Limitations

  • The file size limits for content inspection is between 1KB and 20MB. Events for files outside of this limit shows the verdict bypassed due to size.

  • There is a maximum limit of 256 characters for a regular expression.

  • Base64 encoded files are not supported, and the DLP engine can't inspect the content in these files.

Was this article helpful?

2 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.