faq
flatforty
contribute
subscribe
configure
search
rdf
main
parent
thread
|
Howdy! :)
by Mosfet on Monday 31/Dec/2001, @21:24
|
Hiya, nice to see you here! I found the algorithm in findimagedupes more accurate. It's really rather clever. The algorithm I used is virtually identical to yours. As a matter of fact, the reason I started porting effects from ImageMagick was to get the necessary blur, normalize, and equalize effects as fast as possible ;-) You'd think it would be slower than averaging blocks of pixels like GQView, but actually it's not. In PixiePlus it's way fast :)
Findimagedupes performs quite well with a small amount of images, but really suffers from a small hash table size. For 47 images PixiePlus takes 11 seconds while findimagedupes takes 20. That's really not bad since it's compiled C++ vs. interpreted code. But when you increase the number to 5,049 images PixiePlus only takes 23 minutes while findimagedupes takes 1hr 16min.
PixiePlus also uses a persitant database, but it's not compatible with findimagedupe's. It's a binary database that also includes timestamps so you know if an image is modified and needs to get a new fingerprint. I could write an import and export to findimagedupes if you think it would be useful. I certainly can see value in a commandline utitlity for this as well as Pixie :)
Here's a comparison of the different methods I wrote for the Pixie docs. I also did a quicky screenshot of the comparison results which uses a nifty tree view (double click on the image to view, right click for a menu). It's at http://www.mosfet.org/compare.png. Of course there is also a 2 step progress dialog for generating/loading fingerprints and comparing ;-)
--------------------------
Notes on simularity comparison algoritm and performance
Much attention was spent on making finding similiar images perform as
fast as possible while still being accurate. Compared with other Unix
offerings PixiePlus performs quite well in this aspect.
There are two main algorithms for finding similiar images. One is used
by GQView and ShowImg. It is based on averaging blocks of color
channels and comparing the result. This is the obvious algorithm and
the one I was originally going to adopt. In GQView it performs with
acceptable speed, in ShowImg it does not. I'm not sure why this is
since the code is pretty much directly copied from GQView. Either way,
it tends to miss some matches found in the other available algorithm.
The other algorithm isn't based on colors at all but identifying the
general patterns in the image. This is the method used by the Perl
utility findimagedupes and is the method I adopted for Pixie. What it
does is sample an image to a standard size, apply a couple effects to
get rid of abnormalities, scale it down again, then convert it into a
string of bits suitable for use as a thumbprint. While it sounds like
it would take much longer than the above method, in reality it
doesn't, and it finds similiar images the other algorithm misses.
Like findimagedupes and unlike GQView and ShowImg a persistent
database of fingerprints is used so you don't have to regenerate
fingerprints or calculate image blocks each time you compare images. A
binary database is used in order to avoid parsing ASCII
representations of the binary thumbprint. Unlike findimagedupes the
database also contains timestamps so it can tell if an image has been
modified. A large hash table is used and should perform well for
folders up to 6,000 images. After that expect performance degragation.
Findimagedupes especially suffers from this. On my test machine,
(AMDK6 450MHZ/128M), initially comparing 47 large images takes 11 sec
on Pixie and 20 sec using findimagedupes. This is really good speed
for findimagedupes considering it's in Perl. But when the number of
images is increased to 5,049 findimagedupes slows down to 1hr 16min
while PixiePlus only takes 23min 14 sec (the fastest of the bunch). |
|
|