Hashing for Similarity Detection


Few Months back i was researching on Hash for Similarity Detection, this blog post is the outcome of that.

Hashing or Hash Function as described on Wiki is any Function which maps data of arbitrary size to data of fixed size. Hashing is used almost everywhere with different purposes. It is used in Network Data Transfer, Authentication, Protecting Data, De-Duplication, Data Integrity, Text/Document Similarity and so on.

Mostly hashes are distinguished into 2 groups:

>> Cryptographic Hashes

>> Non-Cryptographic Hashes

Cryptographic Hashes are one way non-reversible hashes mainly used in security applications. The message once encoded into hash cannot be reverse engineered back to the message except by some brute-force method which is time consuming and difficult.

Non-Cryptographic Hashes on the other hand are mainly used for Data Integrity(CRC, Checksums) or Data Similarity. It is used to check whether 2 files (one on the server and other downloaded) are exactly the same, used in Antivirus Application, Spam Detection etc. Non cryptographic hashes are also used in computer forensics for detecting Similar information, we’ll speak more about this in this blog post.

Here is a Wiki List for all the Cryptographic/Non Cryptographic Hash Functions/ Algorithm. What doesn’t seem to be mentioned in the wiki list are hash Functions/Algorithms concentrated on detection of Data Similarity. Often in the Digital Forensic Investigation there is need to detect Similar files, first most prominent Algorithm ssdeep was out in 2006 which then led to many other whitepapers and algorithms and terms like “context-triggered piecewise hashing (CTPH)”, “fuzzy hashing”, “similarity hashing” were coined. Following are some few which are quite known in the Computer Forensic world.

The most prominent ones used for Data Similarity Detection are:

>> ssdeep (http://ssdeep.sourceforge.net/), based on SpamSum Algorithm

>> sdhash (http://roussev.net/sdhash/sdhash.html)

sdhash has a disadvantage that the hash size goes on increasing with the input file size. hence ssdeep looks to be the better one. Following are some other algorithms to consider:

>> mvHash-B (https://dasec.h-da.de/staff/breitinger-frank/)

>> msrsh-v2 (https://dasec.h-da.de/staff/breitinger-frank/)

>> And many others TLSH, MinHash, SimHash… etc do exist.

 

Further Reading:

https://www.dfrws.org/sites/default/files/session-files/paper-automated_evaluation_of_approximate_matching_algorithms_on_real_data.pdf

 

Advertisements

About Dominic

J for JAVA more about me : http://about.me/dominicdsouza
This entry was posted in Thechy Stuff. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s