Skip to content

Hippocampus

Memories for your Kaba experience.

In the human brain, the hippocampus orchestrates the consolidation of information from short-term memory to long-term memory, as well as spaital memory that enables navigation. It acts much the same way in Kaba.

  • Observable memory acquisition
  • Automated memory acquisition

Kaba’s memory extraction tool makes use of site-specific extraction rules to improve results. Each time a URL is processed, it checks to see if there are extraction rules for the site being processed. If there are no rules are found, it tries to detect the content block automatically.

The quickest and simplest way is to use our point-and-click interface. It’s a simple tool only intended to create a rule to extract the correct content block.

Kaba’s customer memory extraction rules are stored in ~/.config/kaba/custom-rules.

For further refinements, e.g. selecting the title, stripping elements, dealing with multi-page articles, please see our help page.

Use git.djcas9.com.txt for

Use .djcas9.com.txt for

  • sport.djcas9.com
  • news.djcas9.com
  • environment.djcas9.com
  • etc.

Use sport.djcas9.com.txt to target just that sub-domain:

  • sport.example.com

Note: .djcas9.com.txt will not match www.djcas9.com or djcas9.com

kabactl will monitor custom-rules direction for modifications, additions, and deletions. When a new rule is added inside of the Kaba UI, navigate to the site you are testing and ctrl + s to create a new memory. You can validate the information is being parsed correctly by opening the memory in the Hippocampus.

Kaba & article_scraper: Creating Observable Memory

Section titled “Kaba & article_scraper: Creating Observable Memory”

At the core of Kaba’s ability to learn and recall information is its integration with the Rust-based article_scraper crate. While Kaba comes pre-packaged with thousands of rules covering roughly 80% of the public web, its true power lies in its custom rule parsing syntax. This allows Kaba to treat internal, self-hosted, or enterprise-grade applications—like a private Forgejo instance—as a structured, observable memory bank.

Kaba uses a declarative syntax to tell article_scraper exactly what matters on a page and what is “noise.” This is particularly vital for developers who host their own infrastructure to avoid the tracking or adversarial nature of centralized platforms. Using our internal source code host, git.djcas9.com (based on Forgejo), as an example, here is how we define a scraping profile:

Terminal window
# 1. SCOPE: Define where these rules apply
# This ensures Kaba only uses this logic on your specific domain or URL pattern.
condition: contains($url, "git.djcas9.com")
# 2. SELECTION: Identify the "Meat"
# We use XPath to prioritize content.
# Here, we target rendered Markdown first, then raw code views, then general repo content.
body: //div[contains(@class, 'markup')] | //div[contains(@class, 'file-view')] | //div[@id='repo-content']
# 3. CLEANUP: Strip the UI Noise
# To keep the "memory" clean, we remove line numbers, navigation tabs, and buttons.
strip: //td[contains(@class, 'lines-num')]
strip: //div[contains(@class, 'file-header-far')]
strip: //div[contains(@class, 'ui-tabs-nav')]
strip: //button
# 4. METADATA: Title Extraction
# This sets the "Header" of the memory entry for easier retrieval later.
title: //div[contains(@class, 'repo-title-wrapper')] | //span[@id='issue-title']
# 5. OPTIMIZATION: Prune
# Setting prune to 'yes' tells Kaba to discard any HTML not explicitly matched in 'body'.
prune: yes
# 6. VALIDATION
test_url: https://git.djcas9.com

For most users, the default rules are invisible and “just work.” However, for Enterprise and Self-Hosted environments, custom rules provide three distinct advantages:

  • Precision: By stripping out line numbers and UI buttons, Kaba’s LLM processes only the logic of your code or the substance of your internal discussions, reducing token waste and improving accuracy.

  • Privacy: Since the scraping logic happens within your Kaba instance, you can index internal tools (like Forgejo, Jira, or internal Wikis) without exposing that data to external scrapers.

  • Contextual Awareness: Kaba treats your internal documentation as a first-class citizen, allowing you to query your private codebase with the same ease as a public StackOverflow thread.

The article_scraper crate processes these rules using a highly efficient Rust backend, ensuring that even large internal repositories can be indexed into Kaba’s observable memory with minimal latency.

Content

Content

Kaba memories are a verbatim data structure representing any interaction, communication, or experience with any core Kaba functionality (web sites, apps, peripherals (via wasm), etc). Every load of a website and every javascript navigation triggers the Kaba memory acquisition process. The memory acquisition process captures all salience of a given page – title, content, links, stylesheets, scripts, video, audio, and sends this info to kabactl to be vectorized and stored in a versioned manner inside lancedb.

Kaba also grabs screenshots of memories and watermarks for enforceable and verifiable credibility (veracode / norton) The hippocampus in Kaba facilitates CRUD style interactions with these memories. Not only individual memories, but their interconnectivity and overlap/union/intersection.

Memories can also hold metadata – context & prompt as two examples. The reason this is useful is eg using Matrix chat server and a particular channel needs extra information (username - actual name; channel - topic). And the prompt gives more direction in how to process and expand on interactions with that memory. “Show this as a conversation between two people”.

Prompt is direction on how to present a memory how to render Context is direction on how to process a memory how to interpret

Searching Memories The search functionality in the Hippocampus does a full embed similarity search against all memories and versions using models to take natural language time constraints and converting those into programmatic timestamps for better memory management. Eg ‘last Wednesday’ is inferred automatically to mean the prior Wednesday via this process.

Using Memories for Better Models All memories are stored in industry standard structures that facilitate direct memory < = > model production. All salience from a given memory becomes features that can be trained on. Eg: all faces on images I have seen => create visual recognition model. All of the javascript I’ve loaded => build a fingerprinting and recognition model. All text I’ve written => produce a large language model.

Physical Memories Memories from digital communications are only the beginning. Using memories from modern web APIs (serial, HID, bluetooth, usb), Kaba can directly siphon information from embedded devices, computers, or peripherals to either enhance an existing memory or to create a corpus of brand new types of memories. This leads to true hardware responsive interactions. If the system understands the hardware, context and LLM interactions reap the benefit.

What about legacy applications / operating system level context and memories? Built into kaba is x86 and wasm based virtualization technologies that allow emulation of any legacy or modern operating system. Keeping these interactions inside the world of context that Kaba can observe. Everything outside Kaba should remain your privacy (ie your os should remain private to you – you’re not a monkey in a zoo). By going further than Kaba and introducing AI to system level things introduces too much risk and vulnerability. There is such a thing as too much context, ask Splunk. Too much data, cost skyrocket, impossible to find needle in haystack. Add what matters, not everything. Maps v territories. There’s a reason it’s called attention, but frontier models are built on adhd. You don’t capture the world and call it attention. Also, whose attention? Train on your attention, not your distraction. Autonomous Memory collection = MCP is redundant

Inline acquisition of context by the act of observation means the mechanisms of a browser solve all the infrastructure and business logic required to infer context and interconnectivity of the url of the data being viewed. Eg. harder for computer to determine: information about current news or CNN.com. “Today there was a riot in Minneapolis”. Where do you tie that into what you’re doing. url’s = natural context indexing. You can’t go to CNN.com to do math research.

Memory acquisition and processing Parsing Engine - site-specific article extraction rules to aid content extractors, feed readers, and ‘read later’ applications. Include read-me file naming

Memories to Models