hnsw

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 14, 2025 License: MIT Imports: 18 Imported by: 5

README

HNSW - Hierarchical Navigable Small World Graphs in Go

This library is a fork of the original implementation by Coder, enhanced with additional features, optimizations, and extensions. We acknowledge and thank the original authors for their excellent work.

Overview

Package hnsw implements Hierarchical Navigable Small World graphs in Go, providing an efficient solution for approximate nearest neighbor search in high-dimensional vector spaces. HNSW graphs enable fast similarity search operations with logarithmic complexity, making them ideal for applications like semantic search, recommendation systems, and image similarity.

Operation Complexity Description
Insert $O(log(n))$ Insert a vector into the graph
Delete $O(M^2 \cdot log(n))$ Delete a vector from the graph
Search $O(log(n))$ Search for the nearest neighbors of a vector
Lookup $O(1)$ Retrieve a vector by ID

The library also includes extensions for metadata storage, faceted search, and other advanced features.

Hybrid Strategy

We introduce a hybrid strategy in this library that combines multiple search techniques, including Hierarchical Navigable Small World (HNSW) graphs, exact search, and Locality-Sensitive Hashing (LSH), to optimize performance across diverse scenarios. This approach dynamically selects the most suitable method based on dataset characteristics, ensuring efficient and accurate approximate nearest neighbor search even in challenging conditions such as high-dimensional data and varying dataset sizes. By leveraging the strengths of each technique, the hybrid strategy provides a solution for applications requiring high performance and adaptability.

Full Implementation

To see the library fully exercised, see Quiver

Installation

go get github.com/TFMV/hnsw@main

Basic Usage

// Create a new graph with default parameters
g := hnsw.NewGraph[int]()

// Add some vectors
g.Add(
    hnsw.MakeNode(1, []float32{1, 1, 1}),
    hnsw.MakeNode(2, []float32{1, -1, 0.999}),
    hnsw.MakeNode(3, []float32{1, 0, -0.5}),
)

// Search for the nearest neighbor
neighbors, err := g.Search(
    []float32{0.5, 0.5, 0.5},
    1,
)
if err != nil {
    log.Fatalf("failed to search graph: %v", err)
}
fmt.Printf("best match: %v\n", neighbors[0].Value)

Advanced Distance Calculations

The library provides a flexible and type-safe approach to distance calculations through the vectortypes package. This allows you to create custom distance functions that work with any type of data.

Using VectorDistance with Custom Types
// Define a custom type
type Document struct {
    ID        string
    Embedding []float32
    Category  string
}

// Create a custom distance function
documentDistance := hnsw.NewVectorDistance(
    vectortypes.ContraMap[vectortypes.F32, Document]{
        Surface: vectortypes.CreateSurface(hnsw.ToVectorTypesDistanceFunc(hnsw.CosineDistance)),
        ContraMap: func(doc Document) vectortypes.F32 {
            return doc.Embedding
        },
    },
)

// Calculate distance between documents
doc1 := Document{ID: "doc1", Embedding: []float32{1.0, 0.0, 0.0}, Category: "A"}
doc2 := Document{ID: "doc2", Embedding: []float32{0.0, 1.0, 0.0}, Category: "B"}
distance := documentDistance.Distance(doc1, doc2)
Creating a Weighted Distance Function
// Define a custom type
type Product struct {
    ID        string
    Embedding []float32
    Category  string
    Price     float32
}

// Define a custom surface
type ProductSurface struct{}

// Implement the Distance method
func (s ProductSurface) Distance(a, b Product) float32 {
    // Vector similarity component (70% weight)
    vectorDist := hnsw.CosineDistance(a.Embedding, b.Embedding) * 0.7
    
    // Category matching component (20% weight)
    var categoryDist float32
    if a.Category == b.Category {
        categoryDist = 0.0
    } else {
        categoryDist = 1.0
    }
    categoryDist *= 0.2
    
    // Price similarity component (10% weight)
    priceDiff := a.Price - b.Price
    if priceDiff < 0 {
        priceDiff = -priceDiff
    }
    priceDist := (priceDiff / 1000.0) * 0.1
    if priceDist > 0.1 {
        priceDist = 0.1
    }
    
    return vectorDist + categoryDist + priceDist
}

// Create a custom distance calculator
productDistance := hnsw.NewVectorDistance[Product](ProductSurface{})

Thread-Safe Operations

The library supports concurrent operations with a thread-safe implementation:

// Create a thread-safe graph with custom parameters
g, err := hnsw.NewGraphWithConfig[int](16, 0.25, 20, hnsw.EuclideanDistance)
if err != nil {
    log.Fatalf("failed to create graph: %v", err)
}

// Perform concurrent operations
var wg sync.WaitGroup
numOperations := 10

// Concurrent searches
wg.Add(numOperations)
for i := 0; i < numOperations; i++ {
    go func(i int) {
        defer wg.Done()
        query := []float32{float32(i) * 0.1, float32(i) * 0.1, float32(i) * 0.1}
        results, err := g.Search(query, 1)
        if err != nil {
            log.Printf("Search error: %v", err)
            return
        }
        fmt.Printf("Search %d found: %v\n", i, results[0].Key)
    }(i)
}

wg.Wait()

Persistence

The library provides facilities for saving and loading graphs from persistent storage:

path := "my_graph.hnsw"
g1, err := LoadSavedGraph[int](path)
if err != nil {
    panic(err)
}

// Insert some vectors
for i := 0; i < 128; i++ {
    g1.Add(hnsw.MakeNode(i, []float32{float32(i)}))
}

// Save to disk
err = g1.Save()
if err != nil {
    panic(err)
}

// Later...
// g2 is a copy of g1
g2, err := LoadSavedGraph[int](path)
if err != nil {
    panic(err)
}

Advanced Features

Batch Operations

For high-throughput scenarios, batch operations reduce lock contention:

// Add a batch of nodes in a single operation
batch := make([]hnsw.Node[int], 5)
for i := range batch {
    nodeID := 100 + i
    vector := []float32{float32(i) * 0.5, float32(i) * 0.5, float32(i) * 0.5}
    batch[i] = hnsw.MakeNode(nodeID, vector)
}
g.BatchAdd(batch)

// Perform multiple searches in a single operation
queries := [][]float32{
    {0.1, 0.1, 0.1},
    {0.2, 0.2, 0.2},
    {0.3, 0.3, 0.3},
}
batchResults, _ := g.BatchSearch(queries, 2)

// Delete multiple nodes in a single operation
keysToDelete := []int{100, 101, 102}
deleteResults := g.BatchDelete(keysToDelete)
Negative Examples

Find vectors similar to your query but dissimilar to specified negative examples:

// Search with a single negative example
dogQuery := []float32{1.0, 0.2, 0.1, 0.0}      // dog query
puppyNegative := []float32{0.9, 0.3, 0.2, 0.1} // puppy (negative example)

// Find dog-related concepts but not puppies (negativeWeight = 0.5)
results, err := g.SearchWithNegative(dogQuery, puppyNegative, 3, 0.5)

// Search with multiple negative examples
petQuery := []float32{0.3, 0.3, 0.3, 0.3}      // general pet query
negatives := []hnsw.Vector{dogNegative, catNegative}

// Find pet-related concepts but not dogs or cats (negativeWeight = 0.7)
results, err = g.SearchWithNegatives(petQuery, negatives, 3, 0.7)
Quality Metrics

Evaluate graph structure and performance:

analyzer := hnsw.Analyzer[int]{Graph: graph}
metrics := analyzer.QualityMetrics()

fmt.Printf("Node count: %d\n", metrics.NodeCount)
fmt.Printf("Average connectivity: %.2f\n", metrics.AvgConnectivity)
fmt.Printf("Layer balance: %.2f\n", metrics.LayerBalance)
fmt.Printf("Distortion ratio: %.2f\n", metrics.DistortionRatio)

Extensions

The library includes several extensions for enhanced functionality:

Metadata Extension

Store and retrieve JSON metadata alongside vectors:

// Create a graph and metadata store
graph := hnsw.NewGraph[int]()
store := meta.NewMemoryMetadataStore[int]()
metadataGraph := meta.NewMetadataGraph(graph, store)

// Create a node with metadata
node := hnsw.MakeNode(1, []float32{0.1, 0.2, 0.3})
metadata := map[string]interface{}{
    "name":     "Product 1",
    "category": "Electronics",
    "price":    999.99,
    "tags":     []string{"smartphone", "5G", "camera"},
}

// Add the node with metadata
metadataNode, err := meta.NewMetadataNode(node, metadata)
if err != nil {
    log.Fatalf("Failed to create metadata node: %v", err)
}

err = metadataGraph.Add(metadataNode)
if err != nil {
    log.Fatalf("Failed to add node: %v", err)
}

// Search with metadata
query := []float32{0.1, 0.2, 0.3}
results, err := metadataGraph.Search(query, 10)
Faceted Search Extension

Filter and aggregate search results based on facets:

// Create a faceted graph
graph := hnsw.NewGraph[int]()
store := facets.NewMemoryFacetStore[int]()
facetedGraph := facets.NewFacetedGraph(graph, store)

// Add a node with facets
node := hnsw.MakeNode(1, []float32{0.1, 0.2, 0.3})
productFacets := []facets.Facet{
    facets.NewBasicFacet("category", "Electronics"),
    facets.NewBasicFacet("price", 999.99),
    facets.NewBasicFacet("brand", "TechCo"),
}

facetedNode := facets.NewFacetedNode(node, productFacets)
facetedGraph.Add(facetedNode)

// Search with facet filters
query := []float32{0.1, 0.2, 0.3}
priceFilter := facets.NewRangeFilter("price", 0, 1000)
categoryFilter := facets.NewEqualityFilter("category", "Electronics")

results, err := facetedGraph.Search(
    query,
    []facets.FacetFilter{priceFilter, categoryFilter},
    10,
)

Performance Considerations

  • Distance Function Selection: Choose the appropriate distance function for your data. Cosine distance is often used for text embeddings, while Euclidean distance is common for image features.
  • Parameter Tuning: Adjust M, EfConstruction, and EfSearch parameters based on your dataset size and performance requirements.
  • Memory Usage: The memory usage grows linearly with the number of vectors and their dimensionality.
  • Batch Operations: Use batch operations for better performance when adding or removing multiple vectors.
  • Custom Distance Functions: While custom distance functions provide flexibility, they may introduce overhead. Benchmark your implementation to ensure it meets your performance requirements.

Benchmarks

The following benchmarks compare different distance calculation approaches:

BenchmarkDirectDistanceFunc-10          10068204               106.6 ns/op
BenchmarkBasicSurface-10                11110857               107.5 ns/op
BenchmarkContraMap-10                   10687495               113.1 ns/op
BenchmarkVectorDistance-10              10065565               139.6 ns/op
BenchmarkCustomSurface-10               11162084               126.6 ns/op
BenchmarkWeightedDistance-10            10984856               109.2 ns/op
BenchmarkHNSWNodeDistance-10             9822542               123.1 ns/op

Examples

For complete examples, see the examples directory:

  • examples/basic/main.go: Basic usage of the HNSW graph
  • examples/optimized_distance/main.go: Demonstrates the optimized distance calculation approach
  • examples/custom_distance/main.go: Shows how to create a custom distance function that combines vector similarity with metadata

Contributing

Contributions to improve HNSW-Go are welcome. Please feel free to submit issues or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CosineDistance

func CosineDistance(a, b []float32) float32

CosineDistance computes the cosine distance between two vectors.

func CreateSurface added in v0.4.0

func CreateSurface(fn DistanceFunc) vectortypes.Surface[vectortypes.F32]

CreateSurface creates a vectortypes.Surface from a DistanceFunc

func EuclideanDistance

func EuclideanDistance(a, b []float32) float32

EuclideanDistance computes the Euclidean distance between two vectors.

func NodeSurface added in v0.4.0

func NodeSurface[K cmp.Ordered](fn DistanceFunc) vectortypes.Surface[Node[K]]

NodeSurface creates a vectortypes.Surface for Node types

func RegisterDistanceFunc

func RegisterDistanceFunc(name string, fn DistanceFunc)

RegisterDistanceFunc registers a distance function with a name. A distance function must be registered here before a graph can be exported and imported.

func ToVectorTypesDistanceFunc added in v0.4.0

func ToVectorTypesDistanceFunc(fn DistanceFunc) vectortypes.DistanceFunc

ToVectorTypesDistanceFunc converts a standard DistanceFunc to a vectortypes.DistanceFunc

Types

type Analyzer

type Analyzer[K cmp.Ordered] struct {
	Graph *Graph[K]
}

Analyzer is a struct that holds a graph and provides methods for analyzing it. It offers no compatibility guarantee as the methods of measuring the graph's health with change with the implementation.

func (*Analyzer[T]) Connectivity

func (a *Analyzer[T]) Connectivity() []float64

Connectivity returns the average number of edges in the graph for each non-empty layer.

func (*Analyzer[T]) Height

func (a *Analyzer[T]) Height() int

func (*Analyzer[T]) QualityMetrics

func (a *Analyzer[T]) QualityMetrics() GraphQualityMetrics

QualityMetrics calculates various quality metrics for the graph. Returns a struct containing metrics that evaluate the graph's quality.

func (*Analyzer[T]) Topography

func (a *Analyzer[T]) Topography() []int

Topography returns the number of nodes in each layer of the graph.

type DistanceFunc

type DistanceFunc func(a, b []float32) float32

DistanceFunc is a function that computes the distance between two vectors.

type Graph

type Graph[K cmp.Ordered] struct {
	// Distance is the distance function used to compare embeddings.
	Distance DistanceFunc

	// Rng is used for level generation. It may be set to a deterministic value
	// for reproducibility. Note that deterministic number generation can lead to
	// degenerate graphs when exposed to adversarial inputs.
	Rng *rand.Rand

	// M is the maximum number of neighbors to keep for each node.
	// A good default for OpenAI embeddings is 16.
	M int

	// Ml is the level generation factor.
	// E.g., for Ml = 0.25, each layer is 1/4 the size of the previous layer.
	Ml float64

	// EfSearch is the number of nodes to consider in the search phase.
	// 20 is a reasonable default. Higher values improve search accuracy at
	// the expense of memory.
	EfSearch int
	// contains filtered or unexported fields
}

Graph is a Hierarchical Navigable Small World graph. All public parameters must be set before adding nodes to the graph. K is cmp.Ordered instead of of comparable so that they can be sorted.

Parameter Tuning Guide:

M: The maximum number of connections per node.

  • Higher values improve search accuracy but increase memory usage and build time.
  • Lower values reduce memory usage but may decrease search accuracy.
  • Recommended range: 8-64, with 16 being a good default for most use cases.
  • For high-dimensional data (>1000 dimensions), higher M values (32-64) often work better.
  • For lower-dimensional data (<100 dimensions), lower M values (8-16) are usually sufficient.

Ml: The level generation factor, controlling the graph hierarchy.

  • Determines the probability of a node being promoted to higher layers.
  • Lower values (e.g., 0.1) create more layers with fewer nodes in higher layers.
  • Higher values (e.g., 0.5) create fewer layers with more nodes in higher layers.
  • Recommended range: 0.1-0.5, with 0.25 being a good default.
  • For very large graphs (>1M nodes), lower values (0.1-0.2) often work better.

EfSearch: The size of the dynamic candidate list during search.

  • Higher values improve search accuracy but increase search time.
  • Lower values speed up search but may decrease accuracy.
  • Recommended range: 20-200, with 20-50 being good defaults.
  • For applications requiring high recall, use higher values (100-200).
  • For applications prioritizing speed, use lower values (20-50).

Distance: The distance function used to compare vectors.

  • CosineDistance is recommended for normalized embeddings (e.g., OpenAI embeddings).
  • EuclideanDistance is recommended for non-normalized embeddings.
  • The choice of distance function should match the properties of your data.

Memory Usage: The memory overhead of a graph is approximately:

  • Base layer: n * d * 4 bytes (for the vectors themselves)
  • Graph structure: n * log(n) * sizeof(key) * M bytes

Where n is the number of vectors, d is the dimensionality, and sizeof(key) is the size of the key in bytes.

func NewGraph

func NewGraph[K cmp.Ordered]() *Graph[K]

NewGraph returns a new graph with default parameters, roughly designed for storing OpenAI embeddings.

func NewGraphWithConfig

func NewGraphWithConfig[K cmp.Ordered](m int, ml float64, efSearch int, distance DistanceFunc) (*Graph[K], error)

NewGraphWithConfig returns a new graph with the specified parameters. It validates the configuration and returns an error if any parameter is invalid.

func (*Graph[K]) Add

func (g *Graph[K]) Add(nodes ...Node[K]) error

Add inserts nodes into the graph. If another node with the same ID exists, it is replaced.

func (*Graph[K]) BatchAdd

func (g *Graph[K]) BatchAdd(nodes []Node[K]) error

BatchAdd adds multiple nodes to the graph in a single operation. This is more efficient than calling Add multiple times when adding many nodes as it only acquires the lock once.

func (*Graph[K]) BatchDelete

func (h *Graph[K]) BatchDelete(keys []K) []bool

BatchDelete removes multiple nodes from the graph in a single operation. This is more efficient than calling Delete multiple times when deleting many nodes as it only acquires the lock once. Returns a slice of booleans indicating whether each key was successfully deleted.

func (*Graph[K]) BatchSearch

func (g *Graph[K]) BatchSearch(queries []Vector, k int) ([][]Node[K], error)

BatchSearch performs multiple searches in a single operation. This is more efficient than calling Search multiple times when performing many searches as it only acquires the lock once.

func (*Graph[K]) BatchSearchWithNegatives

func (g *Graph[K]) BatchSearchWithNegatives(queries []Vector, negatives [][]Vector, k int, negWeight float32) ([][]Node[K], error)

BatchSearchWithNegatives performs multiple searches with negative examples in a single operation. This is more efficient than calling SearchWithNegatives multiple times when performing many searches as it only acquires the lock once.

func (*Graph[K]) Delete

func (h *Graph[K]) Delete(key K) bool

Delete removes a node from the graph by key. It tries to preserve the clustering properties of the graph by replenishing connectivity in the affected neighborhoods.

func (*Graph[K]) Dims

func (g *Graph[K]) Dims() int

Dims returns the number of dimensions in the graph, or 0 if the graph is empty.

func (*Graph[K]) Export

func (h *Graph[K]) Export(w io.Writer) error

Export writes the graph to a writer.

T must implement io.WriterTo.

func (*Graph[K]) Import

func (h *Graph[K]) Import(r io.Reader) error

Import reads the graph from a reader. T must implement io.ReaderFrom. The imported graph does not have to match the exported graph's parameters (except for dimensionality). The graph will converge onto the new parameters.

func (*Graph[K]) Len

func (h *Graph[K]) Len() int

Len returns the number of nodes in the graph.

func (*Graph[K]) Lookup

func (h *Graph[K]) Lookup(key K) (Vector, bool)

Lookup returns the vector with the given key.

func (*Graph[K]) ParallelSearch

func (h *Graph[K]) ParallelSearch(near Vector, k int, numWorkers int) ([]Node[K], error)

ParallelSearch finds the k nearest neighbors from the target node using parallel processing. It's optimized for large graphs and high-dimensional data. The numWorkers parameter controls the level of parallelism. If set to 0, it defaults to the number of CPU cores.

func (*Graph[K]) Search

func (h *Graph[K]) Search(near Vector, k int) ([]Node[K], error)

Search finds the k nearest neighbors from the target node.

func (*Graph[K]) SearchWithNegative

func (h *Graph[K]) SearchWithNegative(near Vector, negative Vector, k int, negWeight float32) ([]Node[K], error)

SearchWithNegative finds the k nearest neighbors from the target node while avoiding vectors similar to the negative example. The negWeight parameter controls the influence of the negative example (0.0 to 1.0, where higher values give more importance to avoiding the negative example).

func (*Graph[K]) SearchWithNegatives

func (h *Graph[K]) SearchWithNegatives(near Vector, negatives []Vector, k int, negWeight float32) ([]Node[K], error)

SearchWithNegatives finds the k nearest neighbors from the target node while avoiding vectors similar to the negative examples. The negWeight parameter controls the influence of the negative examples (0.0 to 1.0, where higher values give more importance to avoiding the negative examples).

func (*Graph[K]) Validate

func (g *Graph[K]) Validate() error

Validate checks if the graph configuration is valid. It returns an error if any parameter is invalid.

type GraphQualityMetrics

type GraphQualityMetrics struct {
	// NodeCount is the total number of nodes in the graph.
	NodeCount int

	// AvgConnectivity is the average number of connections per node in the base layer.
	AvgConnectivity float64

	// ConnectivityStdDev is the standard deviation of connections per node.
	ConnectivityStdDev float64

	// DistortionRatio measures how well the graph preserves distances.
	// Lower values indicate better distance preservation.
	DistortionRatio float64

	// LayerBalance measures how well balanced the layers are.
	// Values closer to 1.0 indicate better balance.
	LayerBalance float64

	// GraphHeight is the number of layers in the graph.
	GraphHeight int
}

GraphQualityMetrics contains various metrics that evaluate the quality of the graph.

type Node

type Node[K cmp.Ordered] struct {
	Key   K
	Value Vector
}

Node is a node in the graph.

func MakeNode

func MakeNode[K cmp.Ordered](key K, vec Vector) Node[K]

type SavedGraph

type SavedGraph[K cmp.Ordered] struct {
	*Graph[K]
	Path string
}

SavedGraph is a wrapper around a graph that persists changes to a file upon calls to Save. It is more convenient but less powerful than calling Graph.Export and Graph.Import directly.

func LoadSavedGraph

func LoadSavedGraph[K cmp.Ordered](path string) (*SavedGraph[K], error)

LoadSavedGraph opens a graph from a file, reads it, and returns it.

If the file does not exist (i.e. this is a new graph), the equivalent of NewGraph is returned.

It does not hold open a file descriptor, so SavedGraph can be forgotten without ever calling Save.

func (*SavedGraph[K]) Save

func (g *SavedGraph[K]) Save() error

Save writes the graph to the file.

type Vector

type Vector = []float32

type VectorDistance added in v0.4.0

type VectorDistance[T any] struct {
	Surface vectortypes.Surface[T]
}

VectorDistance is a generic distance calculator that can work with any type that can be mapped to a vectortypes.F32

func NewNodeDistance added in v0.4.0

func NewNodeDistance[K cmp.Ordered](fn DistanceFunc) *VectorDistance[Node[K]]

NewNodeDistance creates a VectorDistance for Node types

func NewVectorDistance added in v0.4.0

func NewVectorDistance[T any](surface vectortypes.Surface[T]) *VectorDistance[T]

NewVectorDistance creates a new VectorDistance with the given surface

func (*VectorDistance[T]) Distance added in v0.4.0

func (vd *VectorDistance[T]) Distance(a, b T) float32

Distance calculates the distance between two vectors

Directories

Path Synopsis
examples
hnsw-extensions
facets
Package facets provides extensions to the HNSW library for faceted search capabilities.
Package facets provides extensions to the HNSW library for faceted search capabilities.
facets/examples/product_search
Package examples provides example implementations of the hnsw-extensions.
Package examples provides example implementations of the hnsw-extensions.
hybrid
Package hybrid provides hybrid index structures that combine HNSW with complementary indexing approaches to overcome limitations in specific scenarios.
Package hybrid provides hybrid index structures that combine HNSW with complementary indexing approaches to overcome limitations in specific scenarios.
meta
Package meta provides extensions to the HNSW library for storing and retrieving JSON metadata alongside vectors.
Package meta provides extensions to the HNSW library for storing and retrieving JSON metadata alongside vectors.
meta/example
Package example provides examples of using the metadata extension for HNSW.
Package example provides examples of using the metadata extension for HNSW.
Package vector provides optimized vector operations for the HNSW library.
Package vector provides optimized vector operations for the HNSW library.
Package vectortypes provides common types for vector operations
Package vectortypes provides common types for vector operations

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL