ChipmunkNinja
Ninjas are deadly. Chipmunk Ninjas are just weird.
About this blog
Marc Travels
Marc on Twitter
JustLooking on Twitter

Marc Wandschneider is a professional software developer with well over fifteen years of industry experience (yes, he really is that old). He travels the globe working on interesting projects and gives talks at conferences and trade shows whenever possible.

My Publications:

My book, "Core Web Application Programming with PHP and MySQL" is now available everywhere, including Amazon.com

My "PHP and MySQL LiveLessons" DVD Series has just been published by Prentice-Hall, and can be purchased on Amazon, through Informit, or Safari


ABCHKMPRaRoSTVW
xxxxx-xxxxxxxxx
Jun 16, 2005 | 14:46:59
Helping Prevent XSS Attacks in PHP5
By marcwan

Download version 0.9 of StripTags for PHP5

One of the greater dangers facing web application authors today are Cross Site Scripting attacks (given the initialism XSS, so as not to be confused with cascading style sheets). In this, people filling in forms on your web site (such as a comment on a blog entry, etc.) include malicious input that, when others go to view it, can cause effects that range from the annoying (popping up advertisements) to the dangerous (redirecting you to a site that “spoofs” the current site and spies on your input).

A simple example of this would be if you implement a bulletin board-like system via which users can enter small messages of their own. A user could choose to enter in the comment body:

<script>
document.location = "http://maliciousspoofsite.com";
</script>

When they submit this page and somebody else goes to view it, they are redirected, possibly without even knowing it, to another site with all sorts of potential consequences.

Good news arrives with a very basic solution to this problem in the form of the strip_tags function in PHP. This function simply looks for any markup elements in a given string and removes them:

<?php

  $str = "This is a<strong>string</strong> with
<script>document.location = 'http://moo.cow';</script>";

  $str = strip_tags($str);

  echo $str;
?>

This script prints out:

  This is a string with
document.location = 'http://moo.cow';

While it may render output less attractive, it has effectively neutralised the danger.

Another option is the htmlspecialchars function (and its close cousin, htmlentities), will simply convert any < or > characters into the HTML entities: &lt; and &gt; respectively.

Unfortunately, these can be extremely restrictive when we are writing web applications where we want to allow some degree of user input. If we want to let users include hyperlinks, images, or other harmless types of markup, we have a problem.

The strip_tags function does have a solution to this, but only a very crude one (which the authors admit freely and warn about well in advance). You can pass a second parameter to this function which is a string of permitted tags, such as the following:

<?php
  $str = "This <em>is</em> a <strong>string</strong> with
<script>document.location = 'http://moo.cow';</script>";

  $str = strip_tags($str, '<em><strong>');

  echo $str;
?>

The output is now:

  This <em>is</em> a <strong>string</strong> with
document.location = 'http://moo.cow';

While this is a nice improvement, it opens up huge security holes for us depending on those tags we permit:

<?php

$malicious = <<<EOSTR

This is a malicious string with a picture in it:

<img src="http://url/abc.jpg"
     onMouseOver="document.location = 'http://badurl';"/>

<script>
  document.location = "http://badurl";
</script>
EOSTR;

$str = strip_tags($str, '<img>');

echo $str;
?>

While the above code will correctly filter the <script> markup element out, it will still produce the following output:

This is a malicious string with a picture in it:

<img src="http://url/abc.jpg"
     onMouseOver="document.location = 'http://badurl';"/>

  document.location = "http://badurl";

Effectively, the strip_tags function says: If a tag is permitted, then all possible attributes on it are also permitted.

What we would ideally like is a system that protects us not just against malicious tags, but also against malicious attributes within those tags. Even on harmless seeming div or span elements, you can include style attributes that can cause all sorts of mischief.

So, we need to write our own version of the strip_tags function that lets us not only specify which tags are permitted, but also which attributes . I have seen a number of these floating around on the Intarwebs and unfortunately they more often than not do not work properly.

As they parse through the string, they look for opening tags, <, and then begin processing assuming a tag has the following basic structure:

<tagName attribute="value"> </tagName>;

Thus, the common approach is to:
  • Get the opening <
  • Extract the tagName that comes right after and verify that it is permitted.
  • Skip the space character after the tag name.
  • Get the attribute name, which is the text up until the = sign
  • Get the value of the attribute, which is enclosed in double quotes
  • Get the closing >
Unfortunately, for most of the code I have seen, once you stray outside of the most basic of definitions of markup, the algorithm breaks. Consider the following markup:


<tag      attribute     = "value"> </tagName>

<tag[tab][tab]attribute = 'value  '     attribute2 />

<tag
   attribute
= ' value' /   >
</tag>

<tag attribute =' <<<<Some Attribute >>>>>' >
       blah blah blah </      tag>

Changing spaces to tabs or newlines, including multiple spaces, or placing < and > characters within attribute values all break many of the algorithms based on simple string searching or regular expressions (and these regular expressions are already quite horrific).

Even worse, not a single solution I have seen thus far is UTF-8 aware, and will very likely damage or destroy any multi-byte input.

While some may retort right away that not all of these markup variants are “allowed” by various specifications, the reality is that all of these work in every web browser i have tried (well, if i replace “tag” and “attribute” with something meaningful!). Therefore, we as application authors, have to worry about them and process them correctly.

In the end, we have no choice but to write a parser or “state machine” which keeps track of “where” we currently are, whether it is parsing an element, parsing an attribute, or speeding through the value of an attribute. We need to be able to handle all of the variations above and more.

Thus, I have written the StripTags class, attached at the bottom of this article. Included within the archive is a test script which demonstrates some of the input on which I have tested it (it is actively being used in a couple of web applications) and shows some example usage.

The class is fully UTF-8 aware. All of the files in the archive are UTF, so please be careful when loading and saving them—if your editor misbehaves, it might mess things up.

To use the StripTags class, you pass to it an array. The keys are the names of the markup elements you would like to permit while the values are arrays of attributes you would like to permit on each of these. For example:

<?php
  $filter = array(
    'a' => array('href'),
    'img' => array('src', 'border', 'alt', 'title'),
    'strong' => array(),
    'em' => array(),
    'p' => array('align')
  );

  $st = new StripTags($filter);

  $safer = $st->strip($some_unsafe_string);

?>

One type of XSS that we have not yet discussed is a bit more annoying:

<img src="javascript:alert('oh noes!!!')"/>

The ability to embed script in attribute values makes life very difficult for us. One might think that we can just search for and get rid of javascript: in attribute value strings, but we still would have problems with:

<img src="vbscript:alert('oh noes!!!1!!11!')"/>
<img src=&#106;&#97;&#118;&#97;&#115;&#99;
&#114;&#105;&#112;&#116;&#58;&#97;
&#108;&#101;&#114;&#116;&#40;&#39;&#88;
&#83;&#83;&#39;&#41>

There are other languages than javascript, and Unicode escape sequences can be used to encode Javascript.

The StripTags class currently takes a rather basic approach to this:

If the RemoveColons property is set to TRUE (which is the default), then the StripTags function will remove any colon characters or Unicode escape sequences representing colons from attribute value strings. It will, however, let strings start with:

http:
https:
ftp:

This is a bit restrictive, but until I implement of a better solution, the way I will leave it. You can, again, turn this off completely setting RemoveColons = FALSE, but then I’d probably tell your users not to be careful (well, I might tell them that anyway … !)

Here is version 0.9 of the StripTags class (I won’t consider it 1.0 until I come up with a robust solution to the inline attribute script attacks).

Download version 0.9 of StripTags for PHP5

Please do feel free to mail me at marcwan@chipmunkninja.com. This code will only work for PHP 5. It uses class syntax and semantics not available in prior versions. I have tested it with each version starting with PHP 5.0.2

A new function
Posted By: Bede Constantinides Dec 17, 2005 03:15:16
You would have thought that someone would have produced a fool proof php function for this purpose. Does anyone know of one?
Yes Yes ...
Posted By: marcwan Feb 04, 2006 11:10:20

nice try to find the irony in the article and prove that my site itself is vulnerable to XSS :-). I tried that first :-)

Marc.
Foolproof Function
Posted By: marcwan Feb 04, 2006 11:12:19

In Reply to Bede, BTW, the striptags I include in this article is getting increasingly close. I'm going to spend some time trying to truly stamp out the last bits of the problem. Give it a try!

Marc.
Problem with some attributes
Posted By: CleverShark Apr 08, 2006 15:59:40
I've tried using the script, but I'm having problem with doubled double-quotes on attributes (for example the src and alt attributes on an img tag). Am I the only one having this sort of problem?
My five cents
Posted By: Jasper May 10, 2007 05:26:30
Just my "I did only think 5 seconds on this problem"-suggestion for the Unicode escape sequences problem: why not checking the text with html_entity_decode($suspicious_text) for malicious code, and only use the restrictive mode if malicious code is found?
argh!
Posted By: marcwan Jun 17, 2007 03:05:35

i'll have to investigate that last bug report. that sucks!!!
dsdfsdf
Posted By: sdfsdsd Aug 15, 2007 06:32:07

<script>
document.location = "http://badurl";
</script>
hahahah
Posted By: marcwan Aug 15, 2007 20:06:14
nice try. it certainly would be ironic if my site were vulnerable to that :-)

let me try
ok let me try
Posted By: http://www.test.com Feb 18, 2009 08:16:58
\";alert('XSS');//
test
Posted By: test May 04, 2009 04:21:05
¼script¾alert(¢XSS¢)¼/script¾
Excellent, couple of noob probs though
Posted By: Darrell May 11, 2009 01:22:24
Thanks for the good work here Marc. I'll give it a go and see how it fairs.

A couple of notes for the php noobs out there as I hit a couple of problems. First the permissions on the striptagstest.php file had no world read, so the php engine did not play when trying to access it failing with the unhelpful error message 'Warning: Unknown(striptagstest.php): failed to open stream: Permission denied in Unknown on line 0.' ... give it the chmod 644 treatment and that was fixed.

There was also a couple of missing commas in the array at lines 63 and 64 giving error 'Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING, expecting ')' in /var/www/striptagstest.php on line 64'. Once added, all works like a dream. :-)
testing
Posted By: Yair Sep 18, 2009 09:15:27
Just a test
Testing
Posted By: Tester Sep 18, 2009 09:15:55
Test
Posted By: http://www.yahoo.com Nov 23, 2009 08:50:30
got this on your site
Fatal error: Uncaught exception 'PJUnknownRequestException' in /mnt/home/cn/cn-web/payjacks/src/WebApplication.php:134 Stack trace: #0 /mnt/home/cn/cn-web/www/ChipmunkNinja.php(43): WebApplication->run() #1 {main} thrown in /mnt/home/cn/cn-web/payjacks/src/WebApplication.php on line 134
Lol test
Posted By: asdfjklajsdfdfsd Jan 17, 2010 15:33:51
<?php
die;
?>
jjjjjjjjjjjh
<?php
die;
?>
&#60;&#115;&#99;&#114;&#105;&#112;&#116;&#62;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#119;&#111;&#119;&#39;&#41;&#59;&#60;&#47;&#115;&#99;&#114;&#105;&#112;&#116;&#62;
&#60;&#115;&#99;&#114;&#105;&#112;&#116;&#62;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#119;&#111;&#119;&#39;&#41;&#59;&#60;&#47;&#115;&#99;&#114;&#105;&#112;&#116;&#62;
Add a Comment

Title:

Name:

URL:

Comment:

Copyright © 2005-2008 Marc Wandschneider All Rights Reserved.