Jaccard 2: Electric Boogaloo
Over the weekend I was once again thinking about Jaccard Similarities, and I realized that it could be used for sharpening our module boundaries using dependency clustering.
This idea combined with dependency raking and deleting dead code would be oh so satisfying.
Apologies for AI slop, I won't have time to test this for a while, but I wanted to capture the idea.
MODULARIZATION CLUSTERING ALGORITHM
INPUTS
- All .kt files under your module directories (e.g. feature/, library/, app/src/main/)
(skip src/test/, src/androidTest/, src/testFixtures/) - settings.gradle → module include name per directory
- Every build.gradle → declared deps per module
- All .kt source files → scan once to build ClassName→module index
PRE-STEP: BUILD CLASSNAME→MODULE INDEX
Scan every .kt file in the repo.
For each file, extract all top-level class/interface/object declarations.
Map: fully.qualified.ClassName → gradle_module (derived from file directory path)
This index resolves all import ambiguity in Step 1.
STEP 1: BUILD FILE→DEPS MAP
For each .kt production file:
- Parse all import statements
- Resolve each import to a Gradle module using the ClassName→module index
Fallback order:
a. Exact namespace match from android { namespace } in build.gradle
b. ClassName→module index lookup
c. settings.gradle grep
d. Flag as ambiguous and exclude - Discard noise imports:
kotlin., java., android., javax.inject., dagger.*
Output: file → Set<gradle_module> (modules this file depends on)
STEP 2: COMPUTE IDF WEIGHTS PER MODULE-DEP
For each gradle_module dep D:
idf(D) = log(total_files / count_of_files_that_import_from_D)
High-frequency deps (Logger, Coroutines, etc.) → near-zero weight
Low-frequency deps (specific feature modules) → high weight
This is TF-IDF applied to imports — files that share rare dependencies
are much stronger co-location signals than files that share ubiquitous ones.
STEP 3: MODULE-LEVEL CLUSTERING (coarse pass)
For each Gradle module M, build a weighted dep vector:
module_dep_vector(M) = sum of idf(dep) for all deps imported by any file in M
Compute weighted Jaccard between every pair of modules in different
parent directories:
jaccard(M1, M2) = Σ idf(shared_deps) / Σ idf(union_deps)
Find connected components at threshold >= 0.5.
Output: module-clusters (groups of Gradle modules with heavy dep overlap)
Why coarse pass first: file-level Jaccard on a large repo is an O(n²) matrix.
Narrowing to candidate module-clusters first keeps the fine pass tractable.
STEP 4: FILE-LEVEL CLUSTERING (fine pass)
For each module-cluster from Step 3, run file-level Jaccard
only within those modules (much smaller matrix):
jaccard(A, X) = Σ idf(shared_deps) / Σ idf(union_deps)
Only compare files that are in DIFFERENT Gradle modules.
Find connected components at threshold >= 0.6.
Output: file-clusters (files from 2+ modules with near-identical dep profiles)
STEP 5: SCORE EACH CLUSTER FOR DESTINATION MODULE
For each file-cluster, evaluate every candidate destination module M
(including "new module" as an option):
new_deps_needed(cluster → M) = |union_of_imports(cluster) - current_deps(M)|
caller_cost(cluster) = avg number of modules that import each file in cluster
move_score(cluster → M) = new_deps_needed + 0.5 × caller_cost
Pick M with lowest move_score.
The caller_cost term prevents the algorithm from proposing moves that
would require updating dozens of downstream build.gradle files.
Files with high caller counts are better candidates for extraction
into a new :public module than for relocation.
STEP 6: ARCHITECTURE RULE VALIDATION
For the top-scored (cluster → M) pair, check ALL of:
[ ] No internal → internal dep introduced between sibling modules
[ ] No library module → feature module dep introduced
[ ] No public module → internal module dep introduced
[ ] Destination module has the correct build plugin
(e.g. needs Hilt support if any @Module present in cluster)
[ ] All files in cluster have consistent Android API requirements
(not mixing JVM-only files with files that use Android APIs)
PASS → proceed to Step 7
FAIL → skip cluster, log reason, continue to next cluster
STEP 7: EMIT MOVE INSTRUCTIONS
For each validated cluster, output:
MOVE <source_file_path>
TO <destination_file_path> (package updated to match destination module)
ADD implementation projects.<dest_module> (to each caller's build.gradle)
ADD implementation projects.<new_dep> (to destination module's build.gradle)
Note: if the file keeps the same package name after moving,
no import updates are needed in callers — only build.gradle changes.
STEP 8: APPLY MOVES AND CHECK CONVERGENCE
Apply the move instructions from Step 7.
Recompute modularity score Q:
Q = (actual intra-module edges - expected intra-module edges) / total edges
If ΔQ < 0.01 vs previous iteration → STOP (local optimum reached)
Otherwise → return to Step 1 with updated file tree
OUTPUT PER ITERATION
- Ranked list of (cluster, destination, move_score, architecture_valid)
- Concrete file move list with source/destination paths
- build.gradle diffs (deps to add/remove)
- Modularity score delta
- Ambiguous imports that were excluded (for manual review)
NOTES
- Thresholds (0.5 coarse, 0.6 fine) are starting points; tune for your repo size
- The algorithm finds co-location opportunities, not dependency inversion opportunities.
Files that should be decoupled via an interface pattern will appear as high-scoring
clusters but fail architecture validation — treat those as interface extraction signals,
not move signals. - Running on production code only (no test sources) keeps the signal clean.
Test files often have artificially high Jaccard due to shared mock/fake imports.