Repository
Current version released
3 years ago
Dependencies
deno.land/x
std
What is this?
A tool for effectively creating a dependency matrix between many different files. Things that common to the whole group of files are effectively ignored. Currently the tool supports python, but it is fairly generic and Javascript support is planned for the future.
How does it work?
- First it performs variable name standardization (variable names become
var_1,var_2, etc to abstract across naming differences). Note the code maintains its functionality as variable scope is repspected (full language parsing, not regex find-and-replace) - Comments are removed
- A code formatter is used standardized any whitespace/indentation/folding differences
- The core step is stocastic process:
- pick random string chunks (varying length)
- seeing what documents those string chunks belong to
- chunks that only belong to 1 document are āboringā
- chunks that belong to 100% of the documents are āboringā
- the chunk-length is optimized to prefer non-boring chunks
- two documents that share a chunk are considered more similar
- once each documentās ātop-4 most-similar other documentsā have stablized, the process ends
- The output is saved into a json file where:
- The
relativeCountscan be thought of as simply a similarity score for every pair - The
frequencyMatrixis the number-of-chunks-in-common for every pairwise combination - The
commonalityCountsis the distribution of the chunks. The keys are number-of-documents, and the values are quantity of chunks. There will likely be a lot of chunks for1(e.g. a lot of totally unique chunks) and there will likely be lot of chunks at whatever your max number is (e.g. a lot of chunks appear in every document)
How to install
- Get Deno:
s=https://deno.land/install.sh;sh -s v1.36.1 <<<"$(curl -fsSL $s || wget -qO- $s)"
export PATH="$HOME/.deno/bin:$PATH"- Get python black
pip install black
- Then install code_compare:
deno install -n code_compare -A https://deno.land/x/code_compare/compare.jsHow to use
# quick
code_compare --lang python -- ./file1.py ./file2.py ./file3.py ...
# more options
# certainty = 1 means fastest processing time, but least-accurate
code_compare --lang python --output compare.ignore --certainty 10 -- ./file1.py ./file2.py ./file3.py ...