What is this?

A tool for effectively creating a similarity matrix between many different files. It effectively ignores aspects that are common across all files by highlighting “rare” similarites (e.g. two files that are similar to each-other but not-similar to the rest of the group). Currently the tool supports python, but it is fairly generic and Javascript support is planned.

How to use

# quick
code_compare --lang python --  ./file1.py ./file2.py ./file3.py ...

# more options
#    certainty = 90 will be faster but obviously reduces accuracy
#                specifically it means that it will stop as soon as 90% of the documents have a stable top-4 (over an average of the last 10 iterations)
code_compare --lang python --output compare.ignore --certainty 90 --  ./file1.py ./file2.py ./file3.py ...

How to install

Get Deno:

s=https://deno.land/install.sh;sh -s v1.36.1 <<<"$(curl -fsSL $s || wget -qO- $s)"
export PATH="$HOME/.deno/bin:$PATH"

Get python black

pip install black

Then install code_compare:

deno install -n code_compare -Af https://deno.land/x/code_compare/compare.js

How does it work?

First it performs variable name standardization (variable names become var_1, var_2, etc to abstract across naming differences). Note the code maintains its functionality; the variable scope is respected and the names are replaced using full language parsing (not regex find-and-replace)
Comments are removed
A code formatter is used to standardize whitespace/indentation/folding differences
That standardized version of the file is then saved next to the original as ORIGINA_NAME.standardized
Then core analysis begins as a stocastic process:

pick random string chunks (varying length)
see what documents those string chunks can be found in
some chunks only belong to 1 document (perfectly unique)
some chunks belong to 100% of the documents (perfectly commonplace)
the chunk-length is iteratively optimized to neither be unique or commonplace
two documents that share a chunk effectively get a +1 in similarity
once each document’s “top-4 most-similar other documents” have stablized, the process ends

The output is saved into a json file where:

The relativeCounts are the normalized similarity for every pair of documents
The frequencyMatrix is the number-of-chunks-in-common (e.g. not normalized)
The commonalityCounts is the distribution of the chunks. The keys are number-of-documents, and the values are quantity of chunks. There will likely be a lot of chunks for 1 (e.g. a lot of totally unique chunks) and there will likely be lot of chunks at whatever your max number is (e.g. a lot of chunks appear in every document)