You've got 500 documents. A client asks about "that contract we signed with Acme last year." You search. Instantly, you get the contract, the email thread that preceded it, the related SOW, and the person who negotiated it. That's not magic—it's a knowledge graph that understands relationships between documents, people, projects, and concepts in real time.
Most document systems treat files as isolated objects. Search returns a ranked list. You dig through results. AiFiler's knowledge graph does something different: it maps the relationships between everything, then uses those relationships to make search results contextual and fast.
The Core Problem We Solved
Traditional document management systems face a latency wall when trying to surface related information. Here's why:
- Naive relational queries require multiple database hits (find document → find related documents → find people → find projects)
- Full-text search alone doesn't understand context (searching "Acme" returns 47 documents; you still don't know which ones matter)
- Post-search enrichment adds 500ms–2s of latency as the system fetches relationships after ranking
We needed something that could answer "show me everything related to this contract" in under 200ms, even with thousands of documents.
The Architecture: 8 Edge Types, One Unified Graph
The AiFiler knowledge graph is built on a simple principle: edges matter more than nodes. Documents are nodes. Relationships are edges. We defined eight types of edges that capture how documents actually relate to each other in the real world:
Document-to-Document Edges:
REFERENCES— Document A mentions Document B (extracted via Claude)RELATED_TO— Documents share semantic similarity (embeddings-based)PRECEDES— Document A is a predecessor to Document B (temporal/logical ordering)
Document-to-Person Edges:
AUTHORED_BY— A person created the documentMENTIONED_IN— A person is referenced in the documentASSIGNED_TO— A document is assigned to a person for action
Document-to-Concept Edges:
TAGGED_WITH— Manual or AI-generated tagsCLASSIFIED_AS— Document type classification (contract, proposal, email, etc.)
This isn't a complex graph. It's deliberately minimal. Each edge type answers a specific question: "Who created this?" "What does it reference?" "What type of document is it?" The simplicity is the strength—it means we can index and traverse the graph in near-constant time.
How Edges Get Built: The Ingestion Pipeline
When you upload a document to AiFiler, it doesn't just get stored. It enters a pipeline that builds edges:
1. File Parsing (lib/ingest/parseFile.ts)
↓
2. Embedding Generation (Claude with @anthropic-ai/claude-agent-sdk)
↓
3. Entity Extraction (Intent-driven via lib/intelligence/intentHandlers.ts)
↓
4. Edge Creation (Supabase insert with conflict handling)
↓
5. Graph Indexing (SWR cache invalidation + localStorage prefixing)
File parsing extracts text, metadata, and structure from DOCX, XLSX, PPTX, and PDF files. We use @sparticuz/chromium for PDF rendering to ensure we capture visual content accurately.
Embedding generation converts document text into vectors. These embeddings power the RELATED_TO edge—two documents are related if their embeddings are close in vector space. This happens asynchronously; we don't block the upload.
Entity extraction is where Claude earns its keep. We use the UnifiedCommand system (87 intent handlers in lib/intelligence/intentHandlers.ts) to extract:
- Who is mentioned in the document (people)
- What documents it references
- What project or client it belongs to
- What type of document it is
Edge creation writes relationships to Supabase with conflict handling. If an edge already exists, we update metadata (like confidence scores or timestamps). This is where we avoid the "duplicate edge" problem that plagues naive graph systems.
Graph indexing uses AiFiler's SWR-based caching layer (lib/workspace/ hooks) with localStorage prefixing. When a new edge is created, we invalidate the cache for that document's neighborhood, forcing a fresh fetch on the next search.
The Query Path: From Search to Results
When you search in AiFiler, the query doesn't just hit full-text search. It traverses the graph:
User Query: "Acme contract"
↓
Intent Router (UniversalRouter)
↓
Parse Intent: SEARCH_DOCUMENTS
↓
Execute Search:
1. Full-text match on "Acme" + "contract"
2. Rank by relevance score
3. For each result, fetch neighbors:
- Documents it REFERENCES
- Documents that REFERENCE it
- People MENTIONED_IN it
- Related documents (RELATED_TO)
4. Rerank by edge weight
↓
Return enriched results with context
The key insight: we don't fetch the entire graph. We fetch the neighborhood of results—typically 5–10 documents deep. This keeps latency under 200ms even with large graphs.
The reranking step is crucial. A document that's directly referenced by your search result gets boosted. A document by the same author gets a smaller boost. This is why searching "Acme contract" returns not just the contract, but the SOW and the negotiation emails—they're in the neighborhood.
Real-Time Updates Without Rebuilding
One architectural decision we made early: never rebuild the entire graph. Instead, we use incremental updates.
When you add a tag to a document, we:
- Create a new
TAGGED_WITHedge - Invalidate the cache for that document
- Let SWR refetch the document's metadata on next access
When you delete a document, we:
- Soft-delete the document (mark as deleted)
- Remove all edges pointing to it
- Cascade the deletion through the graph
This means the graph stays in sync with your documents without expensive rebuild operations. A user with 10,000 documents can tag a document and see the change reflected in search results within 100ms.
Why This Matters for You
If you're building a document system, you might think: "Why not just use a traditional graph database like Neo4j?"
Graph databases are powerful, but they're overkill for this use case. They add operational complexity (another service to run, monitor, and scale), latency (network round-trips), and cost. Our approach trades some query flexibility for simplicity and speed.
The 8-edge model is opinionated. It doesn't let you model arbitrary relationships. But that's intentional. We've found that 95% of document relationships fall into one of these 8 categories. The remaining 5% can be handled with tags or manual notes.
For power users: This architecture is why Batch Operations work so well. When you select 50 documents and tag them, you're not triggering 50 individual graph updates—you're creating 50 edges in a single transaction. That's why moving or tagging documents in AiFiler feels instant, even at scale.
For teams: The MENTIONED_IN and AUTHORED_BY edges mean your knowledge graph becomes a people graph too. Search for a person's name, and you see every document they've touched. This is invaluable for onboarding new team members or understanding who owns what.
For AI features: The graph powers everything from Universal Command's intent routing to the Knowledge Graph visualization. When Claude needs context to answer a question, it doesn't read all 500 documents—it reads the neighborhood of the most relevant document. This is why AiFiler's AI answers are fast and accurate.
The Tradeoffs We Made
We didn't invent a new database or query language. We built on top of Supabase (PostgreSQL) with application-level logic. This means:
- Simpler operations: No graph query language to learn or optimize
- Easier debugging: Edges are just rows in a table
- Better for teams: Your DBA understands SQL; they don't need to learn Cypher
The tradeoff: complex multi-hop queries (like "find all documents related to people who worked on projects with Acme") require multiple queries. But those queries are rare. Most real-world searches are 1–2 hops deep.
We also chose to keep edge metadata minimal. An edge has a type, source, target, confidence score, and timestamp. No custom properties. This keeps the schema simple and the indexing fast.
What's Next
We're exploring two directions:
- Temporal edges — tracking how relationships evolve over time. Did this document reference another document only in 2024? The graph should know.
- Weighted traversal — letting users tune how much weight different edge types get during search. Some teams care more about authorship; others care more about semantic similarity.
But we're cautious about adding complexity. The current architecture works because it's simple. Every new edge type or feature has to justify itself against the cost of increased latency and maintenance burden.
The knowledge graph isn't the flashiest part of AiFiler. It's not something you see in the UI (though you see its effects every time you search). But it's the foundation that makes everything else possible—fast search, smart recommendations, and AI that actually understands your documents instead of just pattern-matching on keywords.
That's worth building right.
Enjoyed this article?
Get more articles like this delivered to your inbox. No spam, unsubscribe anytime.