May 05, 2004

Proof of concept to throw off the bots.

In one of my previous entries I made reference to the concept of "harvesting e-mails". If you are not familiar with the concept, the general idea is that a "bot" is set in motion against a series of pages to grab e-mail addresses. An automated agent that roams the web - sounds sexy! All it really means is that there is a script running that loads a web page, reads the HTML looking for certain attributes, and then reacts based on those attributes. In this case it looks for things that look like e-mail addresses - most obviously things that are in the form of "name@server.extension" - although they are far more sophisticated than just that these days.

In addition to that type of bot, there are also those that are automated to use services originally only intended for human use. An immediate example that we can all think of experience with can be seen on sites like Yahoo from which you can send e-mail. It would be great for spammers if they could automate the process of signing-up for an account and then sending e-mail... and they did this for awhile. Yahoo eventually got wise to this and added a process by which users would have to enter into a form text off of an image - for instance the user might be shown an image that said "beans" and then the user would have to enter that into a text box and submit it as a way of showing that they weren't a bot.
Granted, it isn't as exciting or sexy as the retinal scans for bot detection in Blade Runner, but it gets the job done against simple bots.

The bot writers started making them more intelligent and sophisticated. They started making the bots look for these images and then do OCR (optical character recognition) on the images (usually done via trained neural nets that are familiar with the type of font that you are using). They could then have the bot determine what the text was and enter it into the form - fooling the system and then allowing them to spam from a free account.

In the ongoing cat and mouse game that has evolved from this, the bot thwarting programmers then started making the images harder for the neural net code to "read" - the images are now blurred, the text is broken-up, the fonts are different, there is noise added and lines crisscross the image, etc. But this isn't foolproof - the bots can still get the images and as long as they can train on those images - they can beat them.
Even if they only beat them some of the time, those times that they get through are still enough for them to send mail out - which is all that they care about. You can have many bots running at the same time, trying to get in - eventually some will get through the defenses (sounds stupidly like Matrix Revolutions).

In a recent meandering of thought, I stumbled upon an idea that would help confuse the bots for a little bit longer in this cat and mouse game, and so what follows is my proof of concept - or at least the very rough outlines of it.
And if I feel daring (stupid?), then I will also post the exploit to it as well - because while it does make it harder for the bots, it doesn't make it impossible. Or perhaps I will leave that up to the bot makers to figure out for themselves.

First off, I should note that I don't know that I am the first person to do this. While I briefly looked around and largely concluded that if there is not a Perl Module that already does this (prior to me writing one), then it is unlikely that it is common practice anywhere on the net (or even relatively uncommon).
I should also add that everyone in "Real Life" that I have talked to about this, they have shown overwhelming ambivalence about it which can either be chalked up to not knowing enough about it to care, or... well, just not caring in general.
Perhaps that is a sign of the value of the idea (*cough*worthless*cough*).

That said, let's roll.
(some of this will be a rehash on concepts touched on in my original thoughts on this, but hopefully in more detail)

Backhistory

A few years back I looked at a Perl Module called "ThousandWords" that allowed one to feed it an image and some text and it would then dump out things that look like this image/page I created using Radiohead's lyrics up until the time I made it as well as a popular image of theirs.
At the time, I thought it would be interesting to modify that code to produce the same images, but instead of in text, do it in DIVs. I changed the code around and in general didn't have too much success since browsers tend to really get unhappy if there are too many elements on a page to render. Also didn't get the positioning of the DIVs worked out very well (didn't try particularly hard either).
I wasn't trying to make the DIVs as small as a pixel, but instead more on the size of 10px on a side or larger. I was trying for a Chuck Close or photomosaic type of look/feel.

After losing interest since the browser clearly didn't handle it well, I didn't give it much thought again until recently. A few days ago I ran across the CSS Pencils page in my RSS feed. This person (Chris Hester) did something so brilliantly obvious that it really made me want to smack myself right in the forehead with a big DUH!
Chris was also converting images into DIVs, and he probably noticed that as the DIV count increased on screen, the browsers would crap out. Where I ignorantly gave up and walked away, Chris went on to treat the DIV like 3 pixels instead of just a single pixel. This obviously means that you can then cut the number of DIVs that you are using by as much as 3.

Ever So Slightly Technical

Technically a DIV could represent 5 pixels: one for each border, and then the center background color of the DIV. But if you take some grid paper and block off a rectangle and then try to optimally fill the space (treating each pixel of the DIV as one block in the grid), you will find that the 5 point DIV (think the shape of the "+" symbol) leaves many pixels as singles which means you then need to go in and fill them individually.
Now, if you take the same rectangle and fill it with 3 pixel shapes (it is easier later programatically if you treat the shapes as 3 horizontal pixels instead of vertical), you can more easily and efficiently cover the space. This obviously optimally works if the shape that you wish to tile is of a width that is evenly divisible by 3 (Width % 3 == 0) - the height (number of rows of pixels) doesn't matter since that depth of the DIV is just a single pixel (or row if you look at it that way).

So now that we can reduce our DIV count by a factor of 3, that means that it is now much more feasible to create some images on screen - especially if they are quite small.

Chris did much more with his CSS Pencils and even went so far as to show various color tricks - while quite cool, didn't particularly interest me all that much since they were slow and didn't scream of anything terribly useful to me.
The main reaction I had was if there was a way that this could possibly be useful beyond just something cool.

What Can I Use It For?

My immediate thought was of sites that try to prevent people from downloading their images. They usually disable right-clicking ability via javascript in the hopes that will deter the bulk of people from being able to save the images. Of course, it is still easy to look at the source, get the image URL, and then load that directly. The other alternative is to take a screen capture and then crop out the image that you want.
This HTML/CSS image variant (divpixel rendering?) doesn't at all prevent the screen shot solution, so I'm not sure it is all that useful for this scenario (plus as we show later, the size wouldn't be all that helpful either).

Then it occurred to me that it might be useful to throw off bots that are looking for text. There are places that show e-mails in image form to avoid plain-text searching bots, and there are places that do what I was speaking of before to deter bot usage of programs intended only for human use (show an image, human reads text/numbers off of the image and then inputs it into the form).


Gotta Ask - How Big?

From here, I wanted to get a grasp on how much HTML/CSS code is created just to replicate what is in an image so that I could determine at what point the devpixel image is too inefficient to ever use.
For example, if it is going to be used on Yahoo every time that they want to make sure someone is a human and not a bot, then they are going to want it to be as small as possible to reduce the amount of bandwidth that they use up on this process.

So using a theoretical image that is 30 X 30 pixels in size, we can do some quick paper napkin calculations and get some figures. Using Chris Hester's technique of 3 pixels per DIV, we could then see... 30*30 is 900 total pixels. Divided by 3 means 300 DIVs that will be on screen.
Now each DIV, going by his design, would look like this:


<div id="p999">&nbsp;</div>

Chris puts an "id" attribute for each DIV tag, which allows him to have specific positioning for each tag, in addition to the colors for it. The "p" has to be there in the id because you can't have an id that starts with a number - this likely stems from javascripts propensity to see "array[16]" and "array['16']" as the same thing (haven't tested that on all browsers, but I know I have seen issues with this in the past). Therefore you need the "p" to differentiate between an index of a DOM array and an associative array key.
Since we are using id attributes, then that means that there will be CSS code for every single DIV that we create.
Each one would look something like:

#p999{border-left:1px solid #FFFFFF;background:#FFF000;border-right:1px solid #000FFF;}

That is roughly 100 bytes of CSS code per DIV, and then roughly 25 bytes per DIV itself. So 125 bytes to represent 3 pixels - not terribly efficient at all.

**note**
that is a rough example of what he is doing and doesn't account for DIV positioning at all at this point

So that means we have 300 * 125, giving us 37,500 bytes, or roughly 35KB.

That in itself isn't too terrible, although it would allow quite a large and complicated JPEG image in the same disk space (or think bandwidth in this case).

What if we had a 300 x 300 divpixel image?
Quickly going through all of that again we have:
90,000 pixels
Divided by 3 is 30,000 DIVs
times 125 is 3,750,000 bytes or 3.6MB give or take a bit... which is HUGE and you get very little real image out of it.

So from this we learned that we want to keep our divpixel image small - that is still okay for our bot confusion code since these images only need to be small.

Now if we saved an image with the PNG format, it does something with LWZ compression (which as far as we need to know turns "aaaaabbbcccccc" into "5a3b6c") and therefore images that have a lot of space in them that are the same color, then it can greatly reduce the space used by the image.
Can we do that with a divpixel image?
Nope. Especially not with the method Chris uses (which is perhaps tied in to his need to do special effects like channels and greyscale and whatnot).
But on the flipside, that method doesn't take up any additional space if it is a complicated image (lots of colors and items, busy like a photo of a forest with a clothesline full of laundry in front of it).

Show Some Class

But there is a way that we can simplify it so that we can cut out a lot of that CSS code. Instead of using "id" attributes on each DIV, we can just use "class" attributes and for each color triad, we create a class. That means if we have a DIV that has the color triad "red white blue" then that would be one class. If we then had another that was "red red white", that again would be a new class. But then if another DIV came up and it needed to be "red white blue" we wouldn't have to create a new class since one already has that color scheme - so we would just assign it that class.
DIV tags (any tags really) can share the same class attribute, but only one tag can have any given id attribute. (this is so that when referencing the id in the DOM you will only refer to a single element)

What Does That Mean For Our Size?

Well, initially it means a little bit more size since instead of "id" we are going to be writing out "class" for each one - that is 3 bytes longer.
BUT - we are getting rid of entire CSS sections - so unless every single DIV has a unique color combination, we are going to use less CSS code than the other version, which saves us about 100 bytes each time.

We can achieve this through two hashes in Perl (I'm not going to go into the various inner workings of a hash or what it is called in other languages).
One maps the class names (we will call it "c9999" where the number ranges from 0 up to however many classes we have - feasibly as many DIVs that we have minus one -- also remember that we are starting it with a letter so that it will validate properly in the browser) to the color triad. The other one maps the color triad to a class name.

That way, when we figure out what the next DIV's colors should be, we can look those up and if they don't exist in the hash, we put them in and use that class right away.
If they do exist, then we can look up the color triad to see what the class name is.

Gotta Put It Somewhere

The nice part about the "id" attribute version is that you could set the position of each one in the CSS for it. But since we have gotten away from that approach and moved to a "class" variation that allows DIVs to share CSS data, we have to do something else.
DIVs (all tags really) will allow you to add a "style" attribute to it and then in there you can put CSS data as well. So we are going to position each DIV (programatically) so that the image shape fills out properly.

I think we could probably get away from this if we put a DIV around them all and set the dimensions on it and then fill the inside with relative position DIVs and set their float to left - but I haven't really tested it on this set of HTML and I'm feeling kind of lazy right now.

Yeah, This is Boring - Show Me Examples

Okay, so here is something visual.
First off - here is an image in PNG format. That could be the sort of image that we might find on any number of sites trying to avoid bots.
That image is less than 1K in size.

Here is the same image as a divpixel HTML/CSS page.
That beast is around 10K in size. 10 times more stuff to show the same thing.
(We could try to use gzip to compress this down if the server and browser both were okay with it - the downside of that is that browsers will usually cache that and then you run into issues since this is going to be dynamically generated)

Here is the code that took a Photoshop created PNG image (that I purposely made divisible evenly by 3 for its width... and height too although that doesn't matter as much) and turned it into the divpixel image.

The next step would be to have a script that would do the same sort of thing, but it would randomly generate a string and then create an image object in memory and then dump out the divpixel image for that. If you were using this on a site, you would want that code to also write to a database of some sort (or a flat file I guess) that you could later check against the user's submission.
You would not want to take in a parameter via the URL since that would then be easy for the bots to exploit and defeat the whole purpose of this exercise.

Here is an example of that sort of script in action, and here is the code of that script.

Instead of randomly generated text, you could instead feed it e-mail addresses (although you would likely want to use a longer and evenly divisible by 3 divpixel image) or anything else.

(if you want a better explanation of any part of the code, post up and I can go into more detail - it should be fairly self explanatory, well, assuming you know Perl)

Can the Bots Get Around It?

Now, in order to get get the code, they need access to it - if we had a static page (in this instance something that is dynamic on the server side is not dynamic once sent to the browser if there is no action taking place on their side and is just the final HTML/CSS code), then they could parse the source easily with a bot - it downloads the source code and moves through it.
They would then have the pain in the ass task of having to recreate the image - but it wouldn't be impossible.
I have come up with a way (a few ways technically, combining them is what works best) that makes it way harder and/or more complicated for them to get the data, but am still working out all of the details of it. If there is no demand for it, then it seems a bit daft to spend the time any further on it.

Is It Worth It?

In closing, I'm not sure that this is any better than doing it in Flash (it seems everyone has the plugin these days, so you don't have that argument).
It has been over a year since I have looked at that, so I don't know how easily scripts can get the text and/or URLs out of compiled Flash code.

And Done (for now)

There are still a few areas where I could clean it up a bit more. I could create a Perl Module out of it. I could show examples of it in PHP, ASP (ugh), Cold Fusion, etc. I could show how to better obfuscate it in the output so that the bots can't get through.
But unless it seems that anyone says "yeah, that isn't a bad idea" then I'm going to feel better about having done the brain dump and go back to the various projects I was doing before.

This has been tested to work under IE 6 on XP, Mozilla under XP, Firebird under OS X 10.3, and Safari 1.2.1.
If it isn't on that list, then it hasn't been tested - but doesn't mean that it won't work.

Anyway, that about covers it, so feel free to debate, discuss, mock, question, or just write out "PECKER PECKER PECKER".

Posted by Eric at May 5, 2004 12:26 AM | TrackBack

Comments

Eric:

You stated:

---
Now if we saved an image with the PNG format, it does something with LWZ compression (which as far as we need to know turns "aaaaabbbcccccc" into "5a3b6c") and therefore images that have a lot of space in them that are the same color, then it can greatly reduce the space used by the image.
Can we do that with a divpixel image?
Nope.
---

But, couldn't you? you could make a div that has a 5px width left border of color a, a 3px box width with background b, and a 6px right border with color c. You really shouldn't have to be restricted to a 3 pixel width div, only a 3 color div.

While this might not buy you much if you're displaying a true-color image, it'd buy you a ton on a two color "text on background" image.

Or, am I missing something?

Posted by: Mark Clements at May 5, 2004 02:43 PM

Sounds like a great idea to me - a forehead smacker on my part (maybe I just need more sleep these days).

Then, just like PNGs and GIFs, if you have long stretches (bands) of color, then it works well in terms of compressing.

In the case of our examples here, it would work very well. In the case of something with many different colors, it wouldn't work as well at all.

When I get home this evening I will make some changes to the code and put it up - thanks for the suggestion.

Posted by: Eric at May 5, 2004 02:55 PM

Alright! Superb idea. First the bad news: it don't work in Opera 7. There are gaps in the outputted image (you cannot read the numbers). I know when I did the CSS Pencils demo I kept testing in Mozilla, Opera and IE. I had a similar problem back then, but can't remember how I solved it. I think your CSS needs to be in just the right order, or something.

Now here are a few ways to improve on your idea: firstly, look at this guy's example, which uses a solid background made without single divs, by running identical colours together:

http://www.stunicholls.myby.co.uk/menu/cssart.html

Next, why duplicate a proper font? All it needs to be is legible - so why not use solid black pixels only, say 5 high? That will give you enough dots to make lettering and numbers. That way the code can be *much* smaller. You can of course also enlarge the div-pixels, so each might be 4x4 or larger.

This would also mean just *two* colours are needed. If one is the background colour of the page, then only 1 colour is needed! In other words, no need for classes.

You could also mess up the letters with random shifts or lines to confuse bots.

You already mentioned using relative positioning and floats. Of course! I wish I had thought of that, except my image needed to be zoomed in. So you could just use *one* div, and fill it with floats! So no need for positioning in your CSS either!

That should leave you with some pretty minimal code. Try it.

Posted by: Chris Hester at May 5, 2004 04:10 PM

Chris - thanks for stopping by! Great ideas - I will give those a shot later tonight... or at least later if not tonight.

Someone over at PerlMonks pointed out that this would not help blind issues (or other disabilities I suppose). That is a tricky area since in order to have a blind person "read" it, they usually have a computer reading the data to them. If the computer can see it, then any bot will see it as well.
Apparently there are a series of lawsuits out right now that people with disabilities are filing against sites using this sort of anti-bot obfuscation.

I think too many people are misreading my idea as a "proof of perfection" instead of a "proof of concept" - this is just an idea and I'm fleshing out the concepts. I am not saying it is perfect or even the way to go - just saying it is another way of doing things.

I'm glad to hear any ideas to make it better - thanks again!

Posted by: Eric at May 5, 2004 04:31 PM

What a brilliant concept.
I have a thought that may cut down of the number of divs per character...why not model each character on a seven segment display? Not quite sure of the details yet but I may investigate.
Something along the the lines of an outer div with just seven inner divs, maybe floated left, for each character. The inner divs would be just two types :
1) a horizontal line.
2) a vertical line (half width).
These divs could be black or white.


Posted by: Stu at May 6, 2004 09:23 AM

Stu - that is similar to what Chris suggested, but even more pared down - sounds great!

Again, all of these ideas break the disabilities issue where those that can't see the screen well enough to make this out are blocked from the process (normally they would use a program that reads the screen text).
It is an unintentional side effect that will always come about if you are trying to avoid things that can be parsed by the computer (then they can be parsed by any bot really).

And Stu, your idea would technically perhaps lend itself to easier "cracking" by bots since various characters would have the same DIV signature (or rather CSS signature) and it could just look for those and then order them to get the string. (which isn't too entirely different than the same issue that all of these divpixel sorts of things fall to)

Posted by: Eric at May 6, 2004 10:29 AM

Just cobbled together a demo of numbers 0-9 using a 5x5 segment display each segment being 2px x 2px.

The css is very small and each number is displayed using just 6 divs including an outer containing div.

I think that bots would take a while to crack this.
You could always duplicate segment styles and randomly use each style so that there would be no logical order to any particular number.

The demo is here
http://www.stunicholls.myby.co.uk/cssforum/segment.html

Posted by: Stu at May 6, 2004 05:40 PM

Stu, that is excellent!
This is the sort of thing I was hoping from this discussion - people adding ideas and trying new things.

Good stuff.

Posted by: Eric at May 6, 2004 05:45 PM

That's the kind of thing I was imagining! Well done Stu!

Posted by: Chris Hester at May 6, 2004 06:13 PM

Font has been refined just a little to give a better look. Problem in IE5.5 though, that needs sorting.

Posted by: Stu at May 8, 2004 08:15 AM

Stu, that is fantastic.

Posted by: Eric at May 8, 2004 01:42 PM

I have added a post about Stu's example on my site:

http://www.designdetector.com/archives/04/05/CSSFont.php

Posted by: Chris Hester at May 8, 2004 05:00 PM

Great idea!

One point: replacing 1px DIVS with 3px DIVS can cut the number by a factor of 9, not 3: 1 square yard = 9 square feet.

Posted by: Dave Richards at May 14, 2004 01:58 AM

"One point: replacing 1px DIVS with 3px DIVS can cut the number by a factor of 9, not 3: 1 square yard = 9 square feet."

Good point Dave - except in this case (or at least in the case of the CSS pencils and in my original proof - Stu's update is different) it is only the horizontal that we are using, not the vertical at all. So instead of squared area, we are only dealing with the single pixel in height and then 3 in width - so it still 3.

If you went into the horizontal and vertical directions with the 3 pixels, then the borders would be lines (technically the center background color would still be a single pixel, but the borders would then all be filled in - and the corners would depend on what various sides are colored).
So that is why it is just the horizontal (you could do just the vertical, but it gets harder in terms of scanning the image - instead of simply scanning over the image by each line, you have to scan ahead in a slightly awkward way and that would add more code).

Looking at the other variations - like what Stu did - then you get into larger blocks of color which come from the textual pieces and therefore really help reduce the size.

Posted by: Eric at May 14, 2004 07:27 AM

If your objective is to obfuscate images then I can suggest another approach. Cut the image into smaller chunks and render them alongside each other. Some image chunks will be identical allowing compression.

BTW any such scheme may be broken if a harvester can screenscrape.

Interesting discussion though. I wonder if SVG might be translated to CSS so that it can be displayed without ASV or CSV?

Posted by: Pete Forman at May 18, 2004 11:05 AM

Another way to bust the bots further (especially the "screen scrapers") would be to use this method to post some simple equation or other idea where some processing of the presented information has to take place in order to produce a correct answer.

Some examples:
2+3=
One plus 3 equals?
7 less six =
What is the plural of goose?
Next # in pattern: 1,2,3?

And so on.

If you wanted smaller dynamic output, you could put some static label with instructions (e.g. "What is the plural of:" or "Solve for the following equation:").

Another idea would be to include a static list of items and then describe one of them in your dynamic output and have the person enter the number (or letter or whatever) corresponding to that item.

Combining some of these ideas with what you have here could put you in the upper eschelon of bot-busting (at least for a little while).

Anyway, great job. This is fun stuff. I am excited to implement some of this type of thing on my site. Keep up the good work.

Posted by: NateL at May 20, 2004 04:15 PM

Hey, will you please remove my e-mail address from there? I didn't realize you would post it in a way that bots can read, especially on a post like this. =)

Thanks.

Posted by: NateL at May 20, 2004 04:16 PM

1) I removed your e-mail from your comment

2) the e-mails show up on the mouseover, but aren't as plain text as they look - that said, they are still not impossible to screen scrape (the bot has to be specifically aware of MoveableType's way of doing it - which I would say there are likely plenty that are).

Posted by: Eric at May 20, 2004 05:01 PM

In terms of the issues with blind people using screen readers. Would there be a way to give the screen readers your code? I mean this is sort of getting ahead of things but I assume there are prefered screen readers out there, could the makers of those products not be contacted and given a way to read the code directly or to bypass it somehow?

To me this seems like a really good idea, being a server admin I hate spammers with a passion they cause so many headaches that I'm sure everyone is already aware of. Anyway, I would love to impliment this on some of the sites I administer (bots that sign up for phpbb just to spam a web page in their profile........ really pointless). Good luck!

Posted by: Justin Mullis at May 23, 2004 04:48 PM

If the css font was enclosed in a 'hidden' span that only became visible on :hover, would this stop the 'screen scrapers'?

Posted by: Stu at May 24, 2004 10:37 AM

Stu, the "screen scapers" technically don't care what is on the screen itself.
What they are actually doing is putting in an HTTP request and then getting the source code returned to them - exactly the same way the browser does. But instead of rendering it out to be visible to humans (the way a browser does), it just goes through the source of the page as a big string and does regular expression matching on it.

So regardless of the CSS attribute on it, it will still see the entity there.

Posted by: Eric at May 24, 2004 11:49 AM

Cool stuff ;)
Hopefully I'll have something to add soon?
I just know theres some K.I.S.S. concept to pull some of these together & make it work.

Keep up the great work!

Posted by: zig at May 29, 2004 03:49 PM

i am caught on the crawlers that use ocr to interprit the images into numbers.

if i was yahoo, i would start asking questions to verify if they are a bot or a human.

"what color is the sun?"
"where do birds go in the winter?"
"why is the sky blue?"
"what is the meaning of life?"

you know. easy questions that someone could answer. if the person or bot can answer then they get the account. if not then they don't get the account.

Posted by: thedude at August 19, 2004 02:09 AM

Agreed - plus that has the nice side of working for the visually impaired since their software can read that off to them.

That said, the bot writer could easily do a series of cases and if it matches (partial match, whatever), then feed in this answer, or this one for this match, etc.
The site could keep adding more, but so could the bot writer.
In the end, the site would have the advantage to always be able to add more questions - and the bot writer would have the advantage that it only takes a few correct answers to get a few accounts, from which they can send a lot of messages.

Posted by: Eric at August 19, 2004 07:57 AM





TrackBack:http://www.spamblogging.com/mt/mt-tb.cgi/81

Listed below are links to weblogs that reference 'Proof of concept to throw off the bots.' from spamblogging.
Tricks to throw off bots harvesting emails
Excerpt: Spamblogging on using CSS Pencils as a way to avoid getting your email address harvested: Proof of concept to throw off the bots. Various scripts already exist that can generate...
Weblog: pollas.dk
Tracked: May 6, 2004 08:45 AM