Skip to content

Repository Map Pattern: AST + PageRank for Dynamic Code Context

Parse source files with tree-sitter to extract structural symbols, rank them by graph importance, then binary-search fit the most relevant entries into the agent's available token budget.

The Orientation Problem

In a large codebase, directory listings, file samples, and keyword greps waste tokens on low-signal content. The agent needs to know which functions exist, which classes matter, and how they connect — not implementation details.

The repository map pattern builds a weighted structural overview fitted to a token budget.

Three-Layer Mechanism

The pattern operates in three stages: parse, rank, fit.

graph LR
    A[Source Files] -->|tree-sitter| B[AST Symbols]
    B -->|reference graph| C[PageRank Scores]
    C -->|binary search| D[Token-Fitted Map]
    D --> E[Agent Context]

1. Parse: Tree-Sitter AST Extraction

Tree-sitter parses source into ASTs and extracts structural elements: function signatures, class definitions, method names, and call signatures. Unlike full file reads, this captures what exists without loading implementation bodies.

Feature ctags tree-sitter
Output Symbol names only Full function signatures
Installation External tool required Bundled via py-tree-sitter-languages
Language support Varies 33+ languages
Structural depth Flat symbol list Nested AST with scope

(Aider blog: Building a better repository map with tree-sitter)

2. Rank: PageRank on the Reference Graph

Source files become nodes in a directed graph; edges connect files sharing symbol references. PageRank with personalization scores each node: files being edited get higher weight, heavily-referenced symbols rank higher, and the result emphasizes task-relevance over sheer size.

PageRank works here because importance propagates through the call graph: a function referenced by 20 files outranks a helper called once, and symbols referenced by important symbols gain transitively elevated scores. BM25 and recency weighting lack this property — the top-ranked symbols surface the architectural spine without any query. (Aider repo map docs)

3. Fit: Binary Search to Token Budget

The get_ranked_tags_map() method binary-searches for the maximum ranked tags that fit within max_map_tokens (default: 1,024), targeting within 15% of budget. Fewer files in context expands the map; more files shrinks it — the agent always gets the most important symbols that fit.

(RepoMapper)

What a Repository Map Looks Like

At different token budgets, the same codebase produces different levels of detail:

# ~200 tokens: top-level structure only
src/auth/auth_service.py
  class AuthService
    def authenticate(user_id, token)
    def refresh_token(token)
src/models/user.py
  class User
    def validate()

# ~800 tokens: expanded with secondary files
src/auth/auth_service.py
  class AuthService
    def authenticate(user_id: str, token: str) -> AuthResult
    def refresh_token(token: str) -> TokenPair
    def revoke_session(session_id: str) -> None
src/auth/middleware.py
  class AuthMiddleware
    def process_request(request) -> Response
src/models/user.py
  class User
    def validate() -> bool
    def to_dict() -> dict
src/models/session.py
  class Session
    def is_expired() -> bool

Higher budget: more files with full type annotations. Lower budget: only the most-referenced symbols.

Benchmark Impact

Aider's system achieved a then-SOTA 26.3% resolve rate on SWE-bench Lite, with 70.3% correct file identification. The map helps the agent locate where to change before deciding what to change. The SWE-bench post credits the repo map but does not isolate its contribution in an ablation; the figure reflects the full Aider stack. (Aider SWE-bench blog post)

Alternative Approaches

Codebase orientation strategies:

Approach Mechanism Best when
Repository map (tree-sitter + PageRank) Pre-computed structural index Large, stable codebases; agent needs cross-file orientation
Agentic search (Claude Code) On-demand Glob, Grep, Read Frequent changes; freshness matters more than structure
Vector embeddings (Cursor, Windsurf) Semantic similarity search Natural-language queries against code

Claude Code skips indexing and uses agentic search — early RAG experiments showed agentic search performed better. (Vadim's blog: Claude Code Doesn't Index Your Codebase) Cursor and Windsurf use vector stores with re-ranking. (Mike Mason: AI Coding Agents in 2026)

MCP Server Availability

The pattern is available as standalone MCP servers, making it tool-agnostic:

  • RepoMapper — Aider's repo map logic as an MCP server; any MCP-compatible agent can request a token-fitted map.
  • mcp-server-tree-sitter — AST-based symbol extraction, dependency graphs, and complexity analysis as MCP tools.
  • Serena — LSP-based approach for symbol-level navigation and editing.

Example

Adding RepoMapper as an MCP server in Claude Code's configuration:

// .mcp.json
{
  "mcpServers": {
    "repomapper": {
      "command": "npx",
      "args": ["-y", "repomapper-mcp"]
    }
  }
}

Once configured, the agent can request a token-fitted map before making changes:

User: Fix the authentication bug in the login flow.

Agent: I'll start by requesting a repository map to understand the codebase structure.

[calls repomapper.get_repo_map with max_tokens=1024]

Result:
  src/auth/auth_service.py
    class AuthService
      def authenticate(user_id: str, token: str) -> AuthResult
      def refresh_token(token: str) -> TokenPair
  src/auth/middleware.py
    class AuthMiddleware
      def process_request(request) -> Response
  src/models/user.py
    class User
      def validate() -> bool

Agent: The map shows AuthService.authenticate and AuthMiddleware.process_request
are the most-referenced auth symbols. I'll read those files first.

The map consumed 87 tokens instead of the ~12,000 tokens that reading all source files would require. The agent identified the right entry points without scanning the full codebase.

When This Backfires

  • Rapidly-changing codebases: The map is recomputed per session but not per edit. In a monorepo with thousands of commits per day, the parsed AST can be stale within minutes; Claude Code's on-demand agentic search is a better fit because it queries the live filesystem.
  • Heavy metaprogramming: Codebases that generate classes or functions at runtime (Rails method_missing, Python metaclasses, macro-heavy Rust) produce AST symbols that don't reflect runtime structure; PageRank over those symbols misleads rather than orients.
  • Small or flat codebases: A repo with fewer than ~20 files gains nothing from the ranking step — reading all source files fits inside a standard context window and provides richer implementation detail than signatures alone.
  • Large repos with huge token budgets: If the agent context window is already large enough to hold most of the codebase directly, the compression step introduces truncation risk for no gain.

Key Takeaways

  • Tree-sitter extraction + PageRank ranking + binary-search fitting produces a weighted structural overview for any token budget.
  • The map adapts dynamically: expands with few files in context, shrinks with many.
  • Three codebase orientation approaches (structural indexing, agentic search, vector embeddings) — choose by codebase size and change frequency.
  • RepoMapper and mcp-server-tree-sitter make the pattern available to any MCP-compatible agent.
Feedback