Here at Spamblogging, I occasionally like to do things way nerdier than usual and actually talk about programming or at least the idea of programming. In the past I have mentioned ways of getting SpamAssassin to work under Exchange. I have also shown a proposal of one way to try to thwart automated bots on registration forms.
Recently we talked about how VoIP looks like it might be the next frontier for spammers. In that post I theorized on a way to potentially block those spams - but I am not going to go into technical detail on that for today - but hopefully soon (especially if I can work up some example code).
But today's thoughts towards code are slightly different than usual. Different in that they don't have a spam specific role and instead are more of a general thought process I have had for some time now for general image categorization using the same tools that some spam filters in e-mail use.
Don't get me wrong, this still could be used for spam detection - or for ad detection in web browsers - but there are plenty of other uses as well.
If you don't feel like reading something fairly nerdy, feel free to not continue on with this post and go back to whatever it was you were doing prior to your arrival here.
This idea has been bouncing around in my head for some time. It seems obvious enough that it has probably been done before - but I haven't seen it done, so this is all just from my head. The concept is to train some software to look at images and then automatically categorize them based on what it has previously learned.
One clear use of this would be over at Google's image search. Their technique for searching looks at the image name and the content on the page and then assumes that the images there must also likely be related - or at least that is how it appears to me. So if you put in a search term, it looks for pages that come up for that term, and then looks for images on that page, and even better if the name of the image happens to relate as well.
But if you could get the program to actually look at the image content instead of the data all around it, then you have an additional data point which it can use select and theoretically make better decisions as to which photos to show.
My idea was to use what is essentially Bayesian analysis (arguably a dumbed down version of it) to look at the image and classify it out based on what it is "taught" by a human user.
The most obvious problem with this is that in order to teach these sorts of programs, you need many data points... and that is really boring to sit there and go through images and train it saying "this image has a cat", "this image is of birds", "this image is red", etc.
But then along came Flickr.com. They are some sort of online photo repository - I don't use it and haven't given it much attention. But then one day I looked and saw that they had "tags". These are cases where their many users with their many pictures have tagged images with terms... meaning that they have already done the work of sorting images out into categories! This is a big deal and a large time saver towards my theoretical program.
Now I could tell my program to look at any given tag over at Flickr, and that happens to be as easy as just adding the tag name as a directory to the url - so all of the images that have the tag "spam" could be seen at http://www.flickr.com/photos/tags/spam. Or change "spam" to any other keyword to see the images associated/categorized with that as their keyword (tag).
Or here is a list of their most popular tags.
This is probably a good time to mention that a lot of this might sound like I am a shill for either Google or Flickr, trying to get you to go use their (free) services. While I would love to actually be getting paid to talk about such things, I am in fact not getting paid. They have no idea I am writing this and very likely don't care.
I am merely using them as examples because they are on the web and big enough so as to be fairly reliable - at least enough for me to assume that tomorrow they will still be there and my program could touch their site and not crash.
So here goes with more meat and potatoes of how the code would work, in a general pseudo-pseudo-code sort of way (that's right, not even real pseudo-code, I'm just that lazy).
0) Programming Language.
The language in which this is written shouldn't matter. That said, I would venture to guess that JavaScript would not be ideal (too slow in execution) and C is probably going to be slower to write (getting all the middling bits right and whatnot) - so my first reaction is to use something like Perl (mod_perl?) that is fast enough to get the job done, and yet has enough built into it (or CPAN ready) that the middling bits are easy to do without any additional programming to get them working properly. Examples of such would be the "my $page = get($url);" in order to go out on the web and download a webpage as a string - vastly easier to have that already done than to have to fiddle with it in C (IMHO that is).
That said, as I will relate later, this is all an exercise in thought - so in the end it doesn't really matter what the intended language may have been on my part.
1) Database setup.
Again, this could be done without a database - just dump it into flat files for all I care. But I like the ability to do easy SQL queries to get the data and not have to muck about building something of my own. My immediate reaction was to use a MySQL database due to how easy it is and my experience with it (and now of course this is the part where groans arise from PostgreSQL lovers around the world).
Three tables laid out like so:
Categories (this will keep track of our categories that we have learned)
pkID - BIGINT
catName - VARCHAR
Images (this will keep track of images that have been learned)
pkID - BIGINT
catID - BIGINT (references Categories.pkID)
imgName - VARCHAR (this is the name in the Flickr url to get a specific image - this way we won't analyze the same image twice for a given category)
Colors
pkID - BIGINT
catID - BIGINT (this references Categories.pkID)
color - VARCHAR(6) (this is 000000 to FFFFFF, the hexidecimal representation of colors where the first two represent 256 possible ways, as do the next two, and the next - so 256^3 possible colors - aka 24 bit - 16.7M colors)
instances - BIGINT (should probably come up with a better name - this is going to count how many times this color comes up for a given category)
2) Training the beast
In order for this to work, it needs to be told categories, and then shown as many images as possible in each category. The more the better. In fact, I would raise the issue that most of the categories on Flickr right now currently don't have enough images to make this truly snazzy yet - but assuming they keep growing and people use the "tags" feature, this will get better over time.
It can start by scanning this page (http://www.flickr.com/photos/tags/) and getting all of those tags. For each of those, it just needs to go to the given tag name in the url (for example http://www.flickr.com/photos/tags/WORD - where WORD would be replaced with any potential category).
Once on a given tag page, it just needs to grab the images in that by iterating over the available page links and collecting the image IDs. For example, under the tag "2004" there is the photo with the ID 327120 that is found at the link http://www.flickr.com/photo.gne?id=327120. Now that page is of no help to us in an automated program because the image is viewed via a Flash interface. But, if we got to this link (http://www.flickr.com/photo_zoom.gne?id=327120&size=m) we then get different sizes and they are actual images that can be accessed to download. If we change over to the thumbnail view, then it has a max size of 100px on a side which is about right for us in terms of limiting size. We then have a link there to download the image.
We write to the database and make sure that we have the category in the Category table. Then we make sure that in the Image table we have the image ID (327120 in this case) in there mapped to the current category - if it is already there, then we can bail out and not bother downloading the image and skip over to the next image.
Once the image is downloaded, we need to scan over the image pixel by pixel (from the top left hand corner, back over to the right and then down over the lines that way). For each pixel, we get its color in the Hex format (000000 - FFFFFF) and write that to the Colors table. If it doesn't exist in the table, then put it in (associating it with the given category), adding its color, and incrementing up the Colors.instances value. If it does already exist, then we only need to increment it up (making sure that we are only incrementing up the instance that is both the current category and the current color).
We can also act on additional category words/names that we can pull from a flat file that has one word per line. As we use the word, we kill it from the list. That way we can create an easy web interface to append new word/tag/category suggestions to add/learn.
3) Testing new images.
So now you have this database full of useful information. Now you can create a webpage that allows users to upload images (file formats JPG, PNG, and GIF). Once uploaded, resize the image so that the largest dimension is 100px and then scan it in pixel by pixel the way we scanned during the learning process.
For each pixel, look it up in the Colors table - you will get several categories out of that, but you will want to take the value that has the largest Colors.instances value.
In the code, you will want a hash so that the keys are the categories and the values then are going to start at 0 and get incremented up. For example "red"->24, "cat->67, etc.
So for each list of colors that you look up, whatever category has the largest Colors.instances value, take that value and get the natural log of it (ln) (technically I think you could safely just take the straight value - the natural log thing just adds computation, but provides a normalization and squashing on the potentially larger numbers - the natural log matters more in genetic algorithms when you will be coming back over the code many times and the normalization/squashing matters more).
Then for the category that has the largest value, add that to the value in the hash for that category (or the natural log of it if you go that way).
Do that for each pixel in the image (which has been made smaller). Once you have made your way through the whole image, pixel by pixel, then you sort your hash on the values so that you can grab the key that has the largest value.
That key is then the category that your image is most likely to fit in.
Or you could grab the top N keys/categories and see a range of where you are likely to fit. Then just plug those back into the Flickr.com url and you can show a link to the category that it would likely fit into.
Theoretically you could open it up so that if the category was correct, then you could add this image to the learning process so that it will get better at future tests.
But you run the risk (especially with an public facing site) of data corruption by someone making it learn images that do not fit into a given category (if the category is "red" and all images within it have red in them, they could then weaken the process by making it start to learn images that have no red, but instead mostly blue in them for that category - which eventually would ruin its ability to accurately match the category).
Why I haven't programmed this myself:
This is an easy programming task. I am pretty sure I could bang this out in a day, maybe two if I got heavily distracted. So if it is easy, then why didn't I do it? I could show that it works (or fails miserably) and then be happy about it all.
But looking at the potential load on the server, here are some numbers with that.
Say we have 1000 categories, and in each category there are 100 images, and each image is 100px by 75px. That means we are looking at a maximum of 1000 * 100 * 7,500 rows that we add to the database in the "Colors" table (the other tables are small enough compared to that, we don't need to immediately consider their size).
That gives us 750,000,000 rows. Ignoring the fact that I don't even know if MySQL will crap out on that many rows (it *should* be related to space and not rows), we can then assume that our data in each row is roughly 30 bytes.
That means we have 22,500,000,000 bytes. Divide by 1024 and we get KB, divide again by 1024 and we get MB, and divide one last time by 1024 and we get GB. Roughly 21 GB of space that would take up.
My server has around 100GB of space available to me and I don't think I am even using 1GB of it at this point, so that isn't a massive issue to me. Fortunately the size only grows with the categories that it scans off of Flickr and not with the usage of the site by visitors - so this wouldn't get out of hand all that fast. But the visitors using it would put a load on the server, and the more data that it has to look through, the higher the load it will go over time.
I host a few different sites on this server and for something that would not give me any financial return at all and is nothing more than an intellectual curiosity that came into my head in the shower one day - I decided it wasn't worth the load on my server.
So I leave it open to others to play with. Here are the cookbook instructions that are all but spelled out for you, leaving only the options of what languages and databases that you want to use as the issue to resolve - and then translating the process out to code. It really isn't hard at all.
If you do have a go at it, please let me know.
What this won't do well for.
While this should do a pretty good job at quickly sorting images into categories that it has previously learned - or at least showing something is more likely to belong in category A than in category B in a visual sense... this isn't going to do well at all with language constructs since it doesn't have any code at all dedicated to such a thing.
Meaning that if you had a category called "red" and it contained images that had red in them, this program would do quite well at detecting that based on what it has learned.
But if you had a category that was all headshots of people that vote Republican and that category was called "Republican", this program would likely do a very lousy job at putting appropriate images in there (that said, it likely would be entertaining to see what comes up - which incidentally is essentially entirely the reason my Donkey Whispers site exists - the randomness is amusing).
This also won't be good for anything specific. For example, say you had a category called "birds" and in it were photos of birds. This may do okay on that if you had a sufficient learning period on that category. But if you wanted it to find a specific bird that you owned, it is not going to be able to do that at all. That is more in the realm of neural nets and would also require a more specific learning process in which it would need to find key points of the image (and it would have to account for 3D rotation which is far more computational work than I am interested in for this little fun project).
Anyway, that should about do it. I welcome anyone's comments as to how it could be better (I don't claim that this is perfect, it is just a concept that I thought up in the shower one morning after a run), and I also welcome someone else that is braver with their server(s) to actually implement it.
Flickr has an API in place, but after looking over it, I didn't see any reason that it would make this particular task any easier since there isn't anything being done other than public browsing.
Posted by Eric at September 3, 2004 08:29 AM
| TrackBack
Another option, which would take more database space, would be to keep track of partial positional data.
That way if a color were next to another color, that would give it a higher score than just the fact that the color existed at all by itself.
That should improve the accuracy of it, but still not overcome any of the language construct issues that I mentioned before.
Posted by: Eric at September 3, 2004 02:21 PM