Repository
Current version released
3 years ago
Dependencies
deno.land/x
std
What is this?
A tool for effectively creating a similarity matrix between many different files. It effectively ignores aspects that are common across all files by highlighting “rare” similarites (e.g. two files that are similar to each-other but not-similar to the rest of the group). Currently the tool supports python, but it is fairly generic and Javascript support is planned.
How to use
# quick
code_compare --lang python -- ./file1.py ./file2.py ./file3.py ...
# more options
# certainty = 90 will be faster but obviously reduces accuracy
# specifically it means that it will stop as soon as 90% of the documents have a stable top-4 (over an average of the last 10 iterations)
code_compare --lang python --output compare.ignore --certainty 90 -- ./file1.py ./file2.py ./file3.py ...How to install
- Get Deno:
s=https://deno.land/install.sh;sh -s v1.36.1 <<<"$(curl -fsSL $s || wget -qO- $s)"
export PATH="$HOME/.deno/bin:$PATH"- Get python black
pip install black
- Then install code_compare:
deno install -n code_compare -Af https://deno.land/x/code_compare/compare.jsHow does it work?
- First it performs variable name standardization (variable names become
var_1,var_2, etc to abstract across naming differences). Note the code maintains its functionality; the variable scope is respected and the names are replaced using full language parsing (not regex find-and-replace) - Comments are removed
- A code formatter is used to standardize whitespace/indentation/folding differences
- That standardized version of the file is then saved next to the original as
ORIGINA_NAME.standardized - Then core analysis begins as a stocastic process:
- pick random string chunks (varying length)
- see what documents those string chunks can be found in
- some chunks only belong to 1 document (perfectly unique)
- some chunks belong to 100% of the documents (perfectly commonplace)
- the chunk-length is iteratively optimized to neither be unique or commonplace
- two documents that share a chunk effectively get a +1 in similarity
- once each document’s “top-4 most-similar other documents” have stablized, the process ends
- The output is saved into a json file where:
- The
relativeCountsare the normalized similarity for every pair of documents - The
frequencyMatrixis the number-of-chunks-in-common (e.g. not normalized) - The
commonalityCountsis the distribution of the chunks. The keys are number-of-documents, and the values are quantity of chunks. There will likely be a lot of chunks for1(e.g. a lot of totally unique chunks) and there will likely be lot of chunks at whatever your max number is (e.g. a lot of chunks appear in every document)