Validating and escaping data (v2)

11 April 2022 793 views | Documentation v2

Developing in a hostile environment

Every IP address and every website on the internet is subjected to continuous probing by bots looking for weaknesses that can be exploited. This happens 24/7. There are no days off. If you connect a new server into the internet it will be found - probably within minutes - and the scanning will start. Sounds like science fiction? It's not. It's quite educational to set up a honeypot and see how long it takes to get attacked. Give it a try and you'll see!

So web applications absolutely have to have strong defences against casual abuse. It is critical that any data accepted from an untrusted source be validated before it is put to use.

Tuskfish provides a standardised set of methods for data validation, which are available as a series of traits that you can include into any of your own classes without concern for inheritance. Stick to the methods they provide, rather than inventing your own, to ensure that your data is validated in a rigorous and consistent way. That way, if an error is found in a Trait, fixing it there will fix your entire site.

The available traits and methods are:

EmailCheck

isEmail(string $email): bool // Validate email address meets specification.

HtmlPurifier

Include and configure an instance of HTMLPurifier for validating HTML form fields and properties as follows:

// Instantiate a copy of HTMLPurifier and assign as a property of this object.
$this->htmlPurifier = $this->getHtmlPurifier();

// Use it to filter HTML input.
$clean['teaser'] = $this->htmlPurifier->purify($teaser);

IntegerCheck

isInt(int $int, int $min = null, int $max = null): bool // Validate integer, optionally include range check.

TraversalCheck

hasTraversalorNullByte(string $path): bool // Check if a file path contains traversals (including encoded traversals) or null bytes.

UrlCheck

isUrl(string $url): bool // Validate URLs.

ValidateString

encodeEscapeUrl(string $url): string // URL-encode and escape a query string for use in a URL.
isAlnum(string $alnum): bool // Check that a string is comprised solely of alphanumeric characters.
isAlnumUnderscore(string $alnumUnderscore): bool // Check that a string is comprised solely of alphanumeric characters and underscores.
isAlpha(string $alpha): bool // Check that a string is comprised solely of alphabetical characters.
isUtf8(string $text): bool // Check if the character encoding of text is UTF-8.
trimString($text): string // Cast to string, check UTF-8 encoding and strip trailing whitespace and control characters.

ValidateToken

validateToken(string $token) // Validate a cross-site request forgery token from a form submission.

Several other traits provide whitelists of permitted values that you can use to populate forms and range check data:

Language

listLanguages(): array // Returns a list of languages in use by the system.

Mimetypes

listMimetypes(): array // Returns a list of common (permitted) mimetypes for file uploads.

Rights

listRights(): array // Returns a list of intellectual property rights licenses for the content submission form.

States

stateList(): array // Returns a list of states ("countries") and dependent territories as recognised by the UN (ISO 3166-1 alpha-2).

Timezones

listTimezones(): array // Returns a list of recognised time zones.

Traits are located in trust_path/libraries/tuskfish/class/Tfish/Traits. They are incorporated into classes with the 'use' keyword, for example:

class Metadata
{
    use \Tfish\Traits\ValidateString;
    ...

Validate data, don't sanitise

Validation means checking that a piece of data matches your expectations - for example that a piece of input is both the type of data that you are expecting and falls within the range of values you were expecting. Tuskfish follows the principle that if data fails validation it should be discarded (there are a few minor exceptions). If the input is bad, reject it and display an error if appropriate.

You will often encounter another (terrible) idea amongst web developers that data should be "sanitised". This school of thought tends to try and combine data validation and escaping operations. So if a piece of input does not meet expectations (fails validation) they will try to interpret and use it anyway, applying various forms of escaping to try and make the malformed input "safe" for use.

Sanitising input usually makes no sense. If a user has made a genuine mistake and entered bad data, is there any point in proceeding to process their request? Probably not. If the bad data is actually an attempt to exploit the system then the situation is even worse, because continuing to process the request will expose the system to unnecessary risk.

If data fails validation, stop.

Escape data at the point of use

Tuskfish policy is to escape data at the point of use, to suit the context it is being used in. Note that validation and escaping are separate operations and should not be mixed together!

Data needs to be escaped in different ways, depending on how it is being used. For example, if you want to send form input to the database for storage you should be concerned about SQL injection attacks. When you retrieve the same data for display on screen you should be concerned about XSS attacks.

Trying to apply every conceivable type of escaping to data at the point of input doesn't work and is a mistake. For example, XSS attacks pose no threat to the database so there is no value in XSS escaping data for storage. Secondly, if you do store XSS escaped data it will break database queries involving escaped characters, eg. a search for "O'Neil" will return no results, because it has been escaped to "O\'Neil".

The type of escaping required depends on the context and that's why it is best to escape data at the point of actual use. Escaping input to make it "database safe" and "XSS safe" are different issues with different solutions. Escape data for database safety when you are sending it to the database; escape data for XSS when you are sending it to display. And so on.

Exceptions

There are a couple of exceptions to this rule, which I will mention briefly:

All $_POST and $_GET data (eg. from forms) arrive as strings. If you need to use a parameter as some other data type you have no choice but to explicitly cast it to the new type. Some type conversions (for example from string to int or bool) effectively sanitise data as a side effect and that's ok.
HTML needs to be encoded for entities before sending to the database, so that browsers can distinguish between markup and content. In Tuskfish the only HTML properties are the teaser, description and icon properties of content objects. The TinyMCE editor handles the escaping for you (it will encode < > & within text nodes, and ' " inside attribute values). The search function is aware of entity encoding and will take it into account when preparing database queries.

Helper methods and functions

A couple of methods worth a special mention:

xss(string $output) encodes entities (do not use on HTML markup, which should be input filtered with HTMLPurifier instead). This a function rather than a method and is available in all templates automatically.
trimString() from the ValidateString class is a method that you will use a lot. Anytime you are expecting string-type data you should pass it through this method first. trimString() strips trailing white space and control characters, checks that the character set is UTF-8 and casts the data to string (so do not use it on non-string data types such as integers!).

Character encoding

Tuskfish requires you to use UTF-8 100% of the time. There is literally no sane reason to use any other character set and it will just cause you problems, so don't do it. You can explicitly test if a string is UTF-8 with the isUtf8() method, but passing it through trimString() will also call isUtf8() internally.

Character restrictions

As an additional measure there are some character restrictions on identifiers. The following are restricted to alphanumeric and underscore characters, only:

Database names.
Column names.
File names, including for templates.

The following are restricted to alphanumeric characters, only:

Table names.
Object property names.

Mitigating SQL injection

Tuskfish guards against SQL injection through exclusive use of PDO and prepared statements with bound values. Queries are constructed using placeholders for parameters. The placeholders are then explictly bound to the actual data, which ensures that it gets properly escaped. Data is never directly inserted into a query. Column and table names ("identifiers") are also internally validated and escaped.

You should not have to worry about SQL injection if you stick to the methods in the Database class, but if you want to write your own queries then you need to be careful (see the database section for more information and resources).

Mitigating XSS attacks

XSS attacks are only a threat to end users viewing your site in a browser. As Tuskfish only supports a single administrative user you (presumably) should not have to worry about people inserting XSS attacks into your site, because you are the only person submitting content. But anyway, it's good practice, it's easy and it will future proof your site. Anytime you are outputting untrusted data to display in a template, pass it through xss( string $data), which is available in every template:

// Pass data through xss() to escape it for display. Do not use on HTML markup as it encodes entities.
<?php echo xss( $content->title() ); ?>