Validating and escaping data

5 February 2018 1143 views | Documentation v1

Developing in a hostile environment

Every IP address and every website on the internet is subjected to continuous probing by bots looking for weaknesses that can be exploited. This happens 24/7. There are no days off. If you connect a new server into the internet it will be found - probably within minutes - and the scanning will start. Sounds like science fiction? It's not. It's quite educational to set up a honeypot and see how long it takes to get attacked, give it a try!

So web applications absolutely have to have strong defences against casual abuse. It is critical that any data accepted from an untrusted source be validated before it is put to use.

The TfValidator class provides methods to handle common data validation operations. Stick to the methods in this class, rather than inventing your own, to ensure that your data is validated in a rigorous and consistent way. That way, if an error is found in the validator class, fixing it there will fix your entire site.

You need to know what it does and how to use it appropriately though. The main things to know are that as principles:

TfishValidator tries to validate data but does not try to sanitise it.
Tuskfish policy is to escape data at the point of use, to suit the context it is being used in. Note that validation and escaping are separate operations and should not be mixed together!

Validate, don't "sanitise"

Validation means checking that a piece of data matches your expectations - for example that a piece of input is both the type of data that you are expecting and falls within the range of values you were expecting. Tuskfish follows the principle that if data fails validation it should be discarded (there are a few minor exceptions). If the input is bad, reject it and display an error if appropriate.

You will often encounter another (bad) idea amongst web developers that data should be "sanitised". This school of thought tends to try and combine data validation and escaping. So if a piece of input does not meet expectations (fails validation) they will try to interpret and use it anyway, applying various forms of escaping to try and make the malformed input "safe" for use.

Sanitising input usually makes no sense. If a user has made a genuine mistake and entered bad data, is there any point in proceeding to process their request? Probably not. If the bad data is actually an attempt to exploit the system then the situation is even worse, because continuing to process the request will expose the system to unnecessary risk. If data fails validation, stop.

Escape data at the point of use

Data needs to be escaped in different ways, depending on how it is being used. For example, if you want to send form input to the database for storage you should be concerned about SQL injection attacks. When you retrieve the same data for display on screen you should be concerned about XSS attacks.

Trying to apply every conceivable type of escaping to data at the point of input doesn't work and is a mistake. For example, XSS attacks pose no threat to the database so there is no value in XSS escaping data for storage. Secondly, if you do store XSS escaped data it will break database queries involving escaped characters, eg. a search for "O'Neil" will return no results, because it has been escaped to "O\'Neil".

The type of escaping required depends on the context and that's why it is best to escape data at the point of actual use. Escaping input to make it "database safe" and "XSS safe" are different issues with different solutions. Escape data for database safety when you are sending it to the database; escape data for XSS when you are sending it to display.

Exceptions

There are a couple of exceptions to this rule, which I will mention briefly:

All $_POST and $_GET data (eg. from forms) arrive as strings. If you need to use a parameter as some other data type you have no choice but to explicitly cast it to the new type. Some type conversions (for example from string to int or bool) effectively sanitise data as a side effect and that's ok.
HTML needs to be encoded for entities before sending to the database, so that browsers can distinguish between markup and content. In Tuskfish the only HTML properties are the teaser, description and icon properties of content objects. The TinyMCE editor handles the escaping for you (it will encode < > & within text nodes, and ' " inside attribute values). The search function is aware of entity encoding and will take it into account when preparing database queries.

Validating input parameters

Typically you will need to validate a few parameters on each page load. Mainly these will be integers related to content IDs, tag IDs and limits and offsets related to pagination controls.

TfValidator provides methods you can use validate the type, and in some cases, range of parameters. The validation methods return true if the data matches expectations and false, otherwise. They do not provide any form of escaping and valid data can be dangerous in some circumstances; for example the apostrophe character " ' " is a legitimate part of the specification for email addresses, but it is not the sort of thing that you want to go unescaped into a database query.

The available validation methods in the TfValidator class, typically accessed as static methods, are:

escapeForXss() // Escape string for XSS by passing it through htmlspecialchars() with UTF-8 encoding.
encodeEscapeUrl() // URL-encode and escape a string for use in a URL (eg. as a query string).
filterHtml() // Validate (and to some extent, "sanitise") HTML input to conform with whitelisted tags, using HTMLPurifier.
hasTraversalorNullByte() // Check if a file path contains directory traversals or null bytes.
isAlpha() // Alphabetical string.
isAlnum() // Alphanumeric string.
isAlnumUnderscore() // Alphanumeric and underscores, only.
isBool() // Boolean values. Non-boolean values and null return false.
isDigit() // Numerical string.
isEmail() // Valid email address. Note that email addresses can legitimately contain database-unsafe characters.
isFloat() // Floating point values.
isInt() // Can optionally range check minimum and/or maximum value.
isIp() // Must specify if IP4 or IP6.
isUrl() // Valid URL.
isArray() // Is an array.
isObject() // Is an object.
isNull() // Is null.
isResource() // For example a file handle, etc.
isUtf8() // Check that a string encoding is consistent with UTF-8.
trimString() // Cast to string, check UTF-8 encoding and strip trailing whitespace and control characters.

A couple of methods worth a special mention:

filterHtml() can be used to screen some untrusted HTML input through the HTMLPurifier library. Tuskfish uses the default configuration of HTMLPurifier (with the exception that it mandates UTF-8 character encoding and permits use of ID attributes), it is possible to customise its behaviour to suit your own preferences, refer to the HTMLPurifier documentation for more information.

trimString() is another helper method that you will use a lot. Anytime you are expecting string-type data you should pass it through this method first. trimString() strips trailing white space and control characters, checks that the character set is UTF-8 and casts the data to string (so do not use it on non-string data types such as integers!).

An example of data validation

This is a typical example of validating input parameters. Tuskfish actually doesn't allow a lot of input parameters, mainly just integers, which you can handle like this:

// Validate input parameters. Note use of new null coalescing operator.
$cleanId = (int) ($_GET['id'] ?? 0);
$cleanStart = (int) ($_GET['start'] ?? 0);
$cleanTag = (int) ($_GET['tagId'] ?? 0);

This is what using TfishFilter to validate input parameters looks like. Note that most of these methods do not sanitise the data, they just tell you if it is the expected type or not:

/** An instance of TfValidator ($tfValidator) is globally available via instantiation in tfHeader.php. */

// Validate that $params is an array and not empty.
if ($tfValidator->isArray($params) && !empty($params)) {
            
    foreach ($params as $key => $value) {  
        if ($value) {

            // Trim null bytes and spaces, cast to string with UTF-8 encoding.
            $cleanKey = $tfValidator->trimString($key);
            $cleanValue = $tfValidator->trimString($value); 
            
            // Validate that string consists of alphanumeric and underscore characters, only.
            if ($tfValidator->isAlnumUnderscore($cleanKey) && $tfValidator->isAlnumUnderscore($cleanValue)) {
                        $cleanFilename .= '&' . $cleanKey . '=' . $cleanValue;
            }
        }
                
        unset($key, $value, $cleanKey, $cleanValue);
    }
}

Character encoding

Tuskfish requires you to use UTF-8 100% of the time. There is literally no sane reason to use any other character set and it will just cause you problems, so don't do it. You can test if a string is UTF-8 with the isUtf8() method:

// Test for UTF-8 compliance. Returns true or false.
if ($tfValidator->isUtf8($some_string) {
...
}

However, calling trimString() will also test for UTF-8 internally (using the above method) while also trimming white space and control characters and casting to string, so in practice you will more often do this:

// Cast to string, check for UTF-8 compliance and remove trailing spaces, null bytes and control characters.
$cleanString = $tfValidator->trimString($dirtyString);
// You may wish to conduct further validation tests, depending on what you are expecting to receive.

Character restrictions

As an additional measure there are some character restrictions on identifiers. The following are restricted to alphanumeric and underscore characters, only:

Database names.
Column names.
File names, including for templates.

The following are restricted to alphanumeric characters, only:

Table names.
Object property names.

Mitigating SQL injection

Tuskfish guards against SQL injection through exclusive use of PDO and prepared statements with bound values. Queries are constructed using placeholders for parameters. The placeholders are then explictly bound to the actual data, which ensures that it gets properly escaped. Data is never directly inserted into a query. Column and table names ("identifiers") are also internally validated and escaped.

You should not have to worry about SQL injection if you stick to the TfDatabase methods, but if you want to write your own queries then you need to be careful (see the database section for more information and resources).

Mitigating XSS attacks

XSS attacks are only a threat to end users viewing your site in a browser. As Tuskfish only supports a single administrative user you (presumably) should not have to worry about people inserting XSS attacks into your site, because you are the only person submitting content. But anyway, it's good practice, it's easy and it will future proof your site, so let's do it.

Generally speaking, output escaping should be conducted in the html template files. In the case of nested templates, child templates should escape their own data before they are inserted into a parent (a parent template cannot escape a child that contains HTML markup without destroying it). In some cases, mainly select boxes, pagination and other controls generated by functions rather than templates, the escaping should be handled by the function before the control is inserted into a parent template, for the same reason as above.

Escaping data for display is basically a case of passing it through htmlspecialchars() specifying UTF-8 encoding. There are two wrapper methods for this in Tuskfish.

Content, preference and metadata objects internally escape data for output when you access a property via a getter method (see below) and also convert it to human readable form, where necessary (eg. converting timestamps to dates). Simply use the appropriate getter method when you want to output a property for display:

echo $contentObject->getTitle();

Note that the teaser and description fields of content objects are not escaped by this method, as they were validated on input by HTMLPurifier (you can't use htmlspecialchars() on HTML markup). However some processing still occurs so it is still important to pass these fields through the internal getter method. At present this is basically replacing the TFISH_ LINK constant used to make URLs portable with the domain name, but more things may concievably be added in future.

You can escape data from any other source (eg. non-object data) for display using the escape() method of TfValidator, for example in a template file you might output a variable like this:

echo $tfValidator->escapeForXss($someVariable);

An important difference with this generic method is that data will not be converted to human readable form (eg. timestamps will not be converted to dates).

Tuskfish CMS Developer Guide

This guide will give you an overview of the architecture of Tuskfish CMS, how to write code to perform common operations and how to extend the system to suit yourself. The guide accompanies the Tuskfish API documentation. Keep a copy handy as you read this guide. It is best to review links to the API where provided, as not every detail will be discussed in the text. This is the first version of the guide, so it is still a work in progress.