Over the weekend I was once again thinking about Jaccard Similarities, and I realized that it could be used for sharpening our module boundaries using dependency clustering.

This idea combined with dependency raking and deleting dead code would be oh so satisfying.

Apologies for AI slop, I won't have time to test this for a while, but I wanted to capture the idea.

MODULARIZATION CLUSTERING ALGORITHM

INPUTS

All .kt files under your module directories (e.g. feature/, library/, app/src/main/)
(skip src/test/, src/androidTest/, src/testFixtures/)
settings.gradle → module include name per directory
Every build.gradle → declared deps per module
All .kt source files → scan once to build ClassName→module index

PRE-STEP: BUILD CLASSNAME→MODULE INDEX

Scan every .kt file in the repo.
For each file, extract all top-level class/interface/object declarations.
Map: fully.qualified.ClassName → gradle_module (derived from file directory path)
This index resolves all import ambiguity in Step 1.

STEP 1: BUILD FILE→DEPS MAP

For each .kt production file:

Parse all import statements
Resolve each import to a Gradle module using the ClassName→module index
Fallback order:
a. Exact namespace match from android { namespace } in build.gradle
b. ClassName→module index lookup
c. settings.gradle grep
d. Flag as ambiguous and exclude
Discard noise imports:
kotlin., java., android., javax.inject., dagger.*
Output: file → Set<gradle_module> (modules this file depends on)

STEP 2: COMPUTE IDF WEIGHTS PER MODULE-DEP

For each gradle_module dep D:
idf(D) = log(total_files / count_of_files_that_import_from_D)

High-frequency deps (Logger, Coroutines, etc.) → near-zero weight
Low-frequency deps (specific feature modules) → high weight

This is TF-IDF applied to imports — files that share rare dependencies
are much stronger co-location signals than files that share ubiquitous ones.

STEP 3: MODULE-LEVEL CLUSTERING (coarse pass)

For each Gradle module M, build a weighted dep vector:
module_dep_vector(M) = sum of idf(dep) for all deps imported by any file in M

Compute weighted Jaccard between every pair of modules in different
parent directories:
jaccard(M1, M2) = Σ idf(shared_deps) / Σ idf(union_deps)

Find connected components at threshold >= 0.5.
Output: module-clusters (groups of Gradle modules with heavy dep overlap)

Why coarse pass first: file-level Jaccard on a large repo is an O(n²) matrix.
Narrowing to candidate module-clusters first keeps the fine pass tractable.

STEP 4: FILE-LEVEL CLUSTERING (fine pass)

For each module-cluster from Step 3, run file-level Jaccard
only within those modules (much smaller matrix):
jaccard(A, X) = Σ idf(shared_deps) / Σ idf(union_deps)

Only compare files that are in DIFFERENT Gradle modules.
Find connected components at threshold >= 0.6.
Output: file-clusters (files from 2+ modules with near-identical dep profiles)

STEP 5: SCORE EACH CLUSTER FOR DESTINATION MODULE

For each file-cluster, evaluate every candidate destination module M
(including "new module" as an option):

new_deps_needed(cluster → M) = |union_of_imports(cluster) - current_deps(M)|
caller_cost(cluster) = avg number of modules that import each file in cluster
move_score(cluster → M) = new_deps_needed + 0.5 × caller_cost

Pick M with lowest move_score.

The caller_cost term prevents the algorithm from proposing moves that
would require updating dozens of downstream build.gradle files.
Files with high caller counts are better candidates for extraction
into a new :public module than for relocation.

STEP 6: ARCHITECTURE RULE VALIDATION

For the top-scored (cluster → M) pair, check ALL of:
[ ] No internal → internal dep introduced between sibling modules
[ ] No library module → feature module dep introduced
[ ] No public module → internal module dep introduced
[ ] Destination module has the correct build plugin
(e.g. needs Hilt support if any @Module present in cluster)
[ ] All files in cluster have consistent Android API requirements
(not mixing JVM-only files with files that use Android APIs)

PASS → proceed to Step 7
FAIL → skip cluster, log reason, continue to next cluster

STEP 7: EMIT MOVE INSTRUCTIONS

For each validated cluster, output:

MOVE <source_file_path>
TO <destination_file_path> (package updated to match destination module)
ADD implementation projects.<dest_module> (to each caller's build.gradle)
ADD implementation projects.<new_dep> (to destination module's build.gradle)

Note: if the file keeps the same package name after moving,
no import updates are needed in callers — only build.gradle changes.

STEP 8: APPLY MOVES AND CHECK CONVERGENCE

Apply the move instructions from Step 7.

Recompute modularity score Q:
Q = (actual intra-module edges - expected intra-module edges) / total edges

If ΔQ < 0.01 vs previous iteration → STOP (local optimum reached)
Otherwise → return to Step 1 with updated file tree

OUTPUT PER ITERATION

Ranked list of (cluster, destination, move_score, architecture_valid)
Concrete file move list with source/destination paths
build.gradle diffs (deps to add/remove)
Modularity score delta
Ambiguous imports that were excluded (for manual review)

NOTES

Thresholds (0.5 coarse, 0.6 fine) are starting points; tune for your repo size
The algorithm finds co-location opportunities, not dependency inversion opportunities.
Files that should be decoupled via an interface pattern will appear as high-scoring
clusters but fail architecture validation — treat those as interface extraction signals,
not move signals.
Running on production code only (no test sources) keeps the signal clean.
Test files often have artificially high Jaccard due to shared mock/fake imports.

Jaccard 2: Electric Boogaloo

MODULARIZATION CLUSTERING ALGORITHM

INPUTS

PRE-STEP: BUILD CLASSNAME→MODULE INDEX

STEP 1: BUILD FILE→DEPS MAP

STEP 2: COMPUTE IDF WEIGHTS PER MODULE-DEP

STEP 3: MODULE-LEVEL CLUSTERING (coarse pass)

STEP 4: FILE-LEVEL CLUSTERING (fine pass)

STEP 5: SCORE EACH CLUSTER FOR DESTINATION MODULE

STEP 6: ARCHITECTURE RULE VALIDATION

STEP 7: EMIT MOVE INSTRUCTIONS

STEP 8: APPLY MOVES AND CHECK CONVERGENCE

OUTPUT PER ITERATION

NOTES

Enjoy this Post?

Jaccard 2: Electric Boogaloo

MODULARIZATION CLUSTERING ALGORITHM

INPUTS

PRE-STEP: BUILD CLASSNAME→MODULE INDEX

STEP 1: BUILD FILE→DEPS MAP

STEP 2: COMPUTE IDF WEIGHTS PER MODULE-DEP

STEP 3: MODULE-LEVEL CLUSTERING (coarse pass)

STEP 4: FILE-LEVEL CLUSTERING (fine pass)

STEP 5: SCORE EACH CLUSTER FOR DESTINATION MODULE

STEP 6: ARCHITECTURE RULE VALIDATION

STEP 7: EMIT MOVE INSTRUCTIONS

STEP 8: APPLY MOVES AND CHECK CONVERGENCE

OUTPUT PER ITERATION

NOTES

Enjoy this Post?

You might also like