Few Months back i was researching on Hash for Similarity Detection, this blog post is the outcome of that.
Hashing or Hash Function as described on Wiki is any Function which maps data of arbitrary size to data of fixed size. Hashing is used almost everywhere with different purposes. It is used in Network Data Transfer, Authentication, Protecting Data, De-Duplication, Data Integrity, Text/Document Similarity and so on.
Mostly hashes are distinguished into 2 groups:
>> Cryptographic Hashes
>> Non-Cryptographic Hashes
Cryptographic Hashes are one way non-reversible hashes mainly used in security applications. The message once encoded into hash cannot be reverse engineered back to the message except by some brute-force method which is time consuming and difficult.
Non-Cryptographic Hashes on the other hand are mainly used for Data Integrity(CRC, Checksums) or Data Similarity. It is used to check whether 2 files (one on the server and other downloaded) are exactly the same, used in Antivirus Application, Spam Detection etc. Non cryptographic hashes are also used in computer forensics for detecting Similar information, we’ll speak more about this in this blog post.
Here is a Wiki List for all the Cryptographic/Non Cryptographic Hash Functions/ Algorithm. What doesn’t seem to be mentioned in the wiki list are hash Functions/Algorithms concentrated on detection of Data Similarity. Often in the Digital Forensic Investigation there is need to detect Similar files, first most prominent Algorithm ssdeep was out in 2006 which then led to many other whitepapers and algorithms and terms like “context-triggered piecewise hashing (CTPH)”, “fuzzy hashing”, “similarity hashing” were coined. Following are some few which are quite known in the Computer Forensic world.
The most prominent ones used for Data Similarity Detection are:
>> ssdeep (http://ssdeep.sourceforge.net/), based on SpamSum Algorithm
>> sdhash (http://roussev.net/sdhash/sdhash.html)
sdhash has a disadvantage that the hash size goes on increasing with the input file size. hence ssdeep looks to be the better one. Following are some other algorithms to consider:
>> mvHash-B (https://dasec.h-da.de/staff/breitinger-frank/)
>> msrsh-v2 (https://dasec.h-da.de/staff/breitinger-frank/)
>> And many others TLSH, MinHash, SimHash… etc do exist.
Further Reading:
Click to access paper-automated_evaluation_of_approximate_matching_algorithms_on_real_data.pdf