slice_id

package module
v0.0.0-...-9033d6b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2024 License: AGPL-3.0 Imports: 8 Imported by: 0

README

slice_id

Go Reference

Given a set of expected input byte (or other type) slices (called "strings" hereafter), slice_id generates a function that efficiently maps input byte slices to unique integers in the range [0, n), with n being the size of the set of expected inputs. The fact that the set of output integers are contiguous makes them useful for building general purpose mapping functions, such as the one provided in map/map.go.

slice_id only looks at a small subset of values in a given input string; because the set of all possible input values is assumed to be known up-front, the slice_id is able to identify the minimal set of indices needed to uniquely identify an input. This means that input slice_id is not able to identify if a string is not in its set. Inputs outside the set will produce an unpredictable result, however, this result will still be contained within the [0, n) range described above.

Lookup is between O(n) and O(log(n)), with n being the size of the set of strings being matched against. The length of the strings is not a factor in lookup performance.

Set compilation is... kinda slow. I've done my best to make it reasonable but haven't quantified it.

Security Warning

It is important to remember that the output for strings that aren't in the set of expected strings is undefined. That means that this library is useful for rapidly parsing long, known-cardinality, low-entropy strings from trusted sources, but should not be used (or at least, should be paired with another data structure that performs set membership queries) to make security decisions based on untrusted input.

strings

slice_id uses a generic definition of String, which can be seen in str/string.go:

type String[S Symbol] []S

type Symbol interface {
  ~int | ~int8 | ~int16 | ~int32 | ~int64 | ~uint | ~uint8 | ~uint16 |
  ~uint32 | ~uint64 | ~uintptr | ~float32 | ~float64
}

Usage

package main

import (
  "fmt"

  "codeberg.org/alice_systems/slice_id"
  "codeberg.org/alice_systems/slice_id/str"
)

func main() {
  set := slice_id.NewBytes([]str.String[byte]{
    []byte("example1"),
    []byte("example2"),
    []byte("test-string"),
    []byte("test-strong"),
  })
  fmt.Println(set.Identify([]byte("example1")))    // 0
  fmt.Println(set.Identify([]byte("example2")))    // 1
  fmt.Println(set.Identify([]byte("test-string"))) // 2
  fmt.Println(set.Identify([]byte("test-strong"))) // 3
  m := slice_id.NewMap[string](set)
  m.Set([]byte("example1"), "test-value")
  m.Set([]byte("example2"), "another-test-value")
  fmt.Println(m.Get([]byte("example1"))) // "test-value"
  fmt.Println(m.Get([]byte("example2"))) // "another-test-value"
}

Implementation

slice_id builds a finite state machine called a "program", that takes the form of an array of "instructions". Each instruction specifies an index in the input to look up, and a value (a "pivot") to compare the result to. The instruction also specifies a pair of indices in the instruction array to jump to, one for if the value is greater than or equal to the pivot, and one for if it is less.

The jump indices are signed ints, allowing them to serve double-duty to mark the termination of the program, as well as store the final return value. Storing this value explicitly (instead of using the index of the final instruction, plus an extra bit for the final compare operation) allows us to store arbitrary values, making it trivial to constrain the range of possible outputs. This is desirable for use cases that directly use the returned value as an array index to implement a Map-like data structure.

If you imagine the instructions as nodes in a directed graph, and the jump indices as edges, and negative jump indices as edges to an imaginary "end" node, you end up with a binary search tree:

The way these nodes are built up is a pretty simple recursive process:

  • If the length of the slice of strings is 1, return a negative number from a monotonically decreasing counter.
  • Find the longest common prefix between all the strings in the slice. Call the position immediately following the end of this prefix i.
  • Find the "best" "pivot" symbol p, that is, the symbol with the "balance" value closest to zero. "Balance" is a measure of how nicely a symbol bisects the strings in the slice. Each string's i'th symbol is compared to the "pivot" symbol; if the pivot symbol is greater than or equal to the string's symbol, the balance is incremented by one; if the pivot symbol is less than the string's symbol, the balance is decremented. If the balance is equal to zero, this means comparing the pivot to each string's symbol at i will divide the slice perfectly in half.
    • slice/Slice.FindPivot implements a binary search to find this pivot in linear time with respect to the Symbol length
    • optionally, repeat this step for a number of subsequent indices, defined by the Lookahead option. In this case, the i is set to index that produces the best pivot symbol, and p is set to that pivot.
  • Split the slice of strings into 2 subslices, with the "high" slice containing the strings whose i'th symbol is greater than or equal to the pivot symbol, and the "low" slice containing the remaining strings. slice/Slice implements a virtual subslice, which is actually just a pointer to a single shared slice containing all the strings, and a bitmap that tracks which slices to consider present in the virtual subslice. This means the splitting operation only requires copying a pointer and the bitmap.
  • Create a new Instruction.
    • Set the Instruction's index field to i.
    • Set the Instruction's pivot field to p.
  • For each of the "high" and "low" subslices, repeat this process from the beginning. Use the respective return value from these repetitions as the values for the high and low fields of the Instruction.
  • Append the Instruction to the "program" array, and return its index.
Possible Improvements

Please note that this section is a bit rambly and speculative. It's also a bit out of date, as some less dramatic changes ended up improving tree balance significantly (at the cost of increased Set compilation time).

Tree Balance

The binary search tree is poorly balanced. It remains unclear whether it is possible to apply tree rotations to the structure while maintaining correctness; thus it is unclear whether it is possible to use a standard tree-balancing algorithm to re-balance the tree.

The reason why this is not obvious is that each instruction bisects the subset of possible values provided to it by the upstream instruction. The instruction was only selected to bisect this particular set; if the depth of the tree is reduced, some instructions will instead be bisecting a different, larger set than the one they were originally selected for. It seems like it would be a matter of chance whether the instruction would happen to also bisect this other set. At a minimum, it would likely be necessary to re-search for pivots for any nodes downstream of a rotation.

The pivots themselves may also just be suboptimal. Currently, the index used for the comparison is identified by simply finding the longest common prefix of the set of strings being bisected. Once this index is found, the change in value of this index closet to the middle of the set is chosen as the place to perform the comparison. Perhaps it would be better to search a few different indices, beyond this first position, and select the one whose pivot is closest to the middle. This would increase the cost of compiling the program though, so it would probably make sense to provide a "depth" parameter to allow the user to tune the number of alternative pivot indices considered depending on their performance requirements.

Another option would be to treat the strings as bitmaps. Instead of indexing words within the string to perform comparisions on, perhaps it would be simpler to index bits within the string; instead of the program branching on the result of a comparison operation, the program branches on the value of the bit at that index. Combining this with the concept of considering indices beyond the first non-uniform index also raises the possibility of creating non-obvious axes of bisection, which could be useful for improving balancing. This also could reduce the instruction size: the index size has to be larger to contain the bit-level instead of word-level position, but in exchange the "pivot" value is removed entirely.

Another option: remove the sorting requirement. This would complicate findPivot; instead of scanning outwards from the middle of the slice, it would instead perform a binary search. It starts with a pivot value of n/2 (with n = the maximum value of the symbol type), and counts how many values at the pivot index are greater than or equal to it. Depending on the bias of the values (either more values above or more values below) it then chooses a new pivot at either n/4 or 3n/4, and continues on recursively until either the number of values above and below are equal, or the number of steps is equal to the number of bits in the symbol type. Note that this wouldn't just be necessary to remove the sorting requirement, it would also be needed for any searches beyond the first non-common index.

This option would be expensive as splitting the slice would no longer be a simple matter of slicing it; instead it would need to go through the slice and select values for one or the other sub-slice based on the pivot identified. This cost may be partially offset by the removal of the sort operation up-front.

Random unrelated idea: treat the symbol indices as modulo length. That way, shorter strings are not excluded from comparisons at higher indices.

Orthogonal Comparisons

This points to a more fundamental limitation: the comparison operations are not independent. This algorithm is essentially a binary search; the strings are ordered along a single dimension and this one-dimensional array is repeatedly bisected until the size reaches 1.

However, with the exact same instructions, it should be possible to instead arrange the set of possible inputs into an n-dimensional space (where n = log2(|inputs|)), with n orthogonal axes that are each bisected by a single instruction. However, it seems inevitable that there are sets of strings that cannot be bisected along entirely orthogonal axes, which would then require a fallback to partitioning the space. This would increase the program size and number of operations above the theoretical minimum. An obvious example of this would be a set of mostly short strings, with a few long strings that all have a common prefix longer than the short strings. There is no axis along which you can bisect the set of long strings that would even touch the short strings. Solving this obvious case would this limitation would require a more creative way to index into strings of variable length, but this wouldn't fix subtler examples.

Implementing this would require thinking about the problem in a completely different way from the Trie-inspired method the current compiler uses (implemented in program/program.go). I currently lack the theoretical knowledge to attempt anything beyond a brute-force implementation of such a method, but it appears to be a very promising direction to explore in the future, as this would remove any ordering requirements from the comparisons, making the prospect of on-the-fly modifications to the program more tenable.

Update: this would require a different instruction type to fully realize the benefits; fully orthogonal terms do no need to be arranged into a tree, they can instead just be evaluated sequentially with each term producing one bit of output. The orthogonality benefits could be partially realized with the existing instruction format, by assembling them into a tree, but without adding any kind of storage to the state machine the tree would necessarily have redundant nodes.

Simple machine change: if the instruction's two jump targets are equal, this tells the machine to add one bit of state to its storage, containing the result of the comparison. This means that comparisons either provide a branching path, or an orthogonal bit of state. However unless these orthogonal axes are all perfectly balanced, the contiguous output range is lost.

Synthesis of these ideas
  • We look at bits rather than words
  • We use longest-common-prefix to identify a starting index to look for approximately-bisecting bits
  • We then linearly scan forward, finding additional bits that bisect all sub-slices. Perhaps we use longest-common-prefix on sub-slices as a heuristic to find promising sub-slices.

worse idea:

  • scan through all bits of all slices, creating a "balance" metric for each position
  • sort the indices by balance
  • choose the highest-balance bit for the first cut, linearly scan forward through this sorted list as described above
  • This performs very poorly for sets of mostly short strings with a few long ones with long shared prefixes: as any bit position that uniquely identifies these strings will have a low balance
  • also: balance doesn't in any way represent orthogonality to the currently selected criteria. The usefulness of this balance metric rapidly diminishes as more criteria are chosen and the orthogonality constraint gets more intensive (is this true? it still prevents the compiler from considering low-balance indices that are always a bad idea too early)

Abstractly, what we probably want to do is improve the orthogonality of the instructions, while preserving a notion of not all instructions being orthogonal.

Perhaps an initial step that performs a somewhat lazy search for fully orthogonal splits, which are then arranged into a "base" set of instructions, followed by a set of trees.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Map

type Map[S str.Symbol, K str.String[S], V any] interface {
	Get(K) V
	Set(K, V)
}

Map implements a basic mapping type, using a Set to translate str.String keys to indices in a pre-allocated slice of values.

func NewMap

func NewMap[V any, S str.Symbol, K str.String[S]](s Set[S, K]) Map[S, K, V]

NewMap creates a Map using a Set for key lookup. The Set is not copied, allowing multiple Maps to share a single Set with minimal overhead. The values are stored in a slice that is allocated up-front.

type Set

type Set[S str.Symbol, T str.String[S]] interface {
	// Identify returns a unique integer associated with the string from the Set
	// that the input is equal to. The number will always be in the range [0,
	// Length()), where n is the number of strings in the Set. This means the set
	// of all possible output numbers is contiguous, allowing it to be directly
	// used to index an array, which is useful for implementing mapping types.
	//
	// If the input is not equal to any of the strings represented by the Set,
	// the return value is undefined, but still bound within the range described.
	Identify(b T) int
	// Length() returns the number of strings represented by the Set.
	Length() int
	// Graph writes a Graphviz representation of the underlying mechanism used to
	// identify strings. This is for development and debugging purposes, and is
	// not expected to be useful for applications.
	Graph(w io.Writer, cluster bool, ports bool) error
	// Write writes a compact binary representation of the Set to a given
	// [io.Writer].
	Write(io.Writer) error
}

Set represents a set of possible str.String values to match an input against.

func Load

func Load[T str.Symbol](r io.Reader) (Set[T, str.String[T]], error)

Load parses a binary representation of a Set from an io.Reader, as produced by [Set.Write], and generates a new Set.

func New

func New[Y str.Symbol](v []str.String[Y], o ...options.Option) Set[Y, str.String[Y]]

Create a new Set that represents a given collection of str.String values. See program.New and program.Program for details about the implementation.

func NewBytes

func NewBytes(b []str.String[byte], o ...options.Option) Set[byte, str.String[byte]]

Create a new Set that represents a given collection of `[]byte` values.

Directories

Path Synopsis
examples
Package slice implements some useful functions for manipulating slices of str.String values, and finding "pivots" (specific str.Symbol values at specific indices for which comparison operations approximately bisect the slice).
Package slice implements some useful functions for manipulating slices of str.String values, and finding "pivots" (specific str.Symbol values at specific indices for which comparison operations approximately bisect the slice).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL