slice_id

Given a set of expected input byte (or other type) slices (called
"strings" hereafter), slice_id
generates a function that efficiently maps
input byte slices to unique integers in the range [0, n)
, with n being the
size of the set of expected inputs. The fact that the set of output integers
are contiguous makes them useful for building general purpose mapping
functions, such as the one provided in map/map.go
.
slice_id
only looks at a small subset of values in a given input string;
because the set of all possible input values is assumed to be known up-front,
the slice_id
is able to identify the minimal set of indices needed to
uniquely identify an input. This means that input slice_id
is not able to
identify if a string is not in its set. Inputs outside the set will produce
an unpredictable result, however, this result will still be contained within
the [0, n)
range described above.
Lookup is between O(n) and O(log(n)), with n being the size of the set of
strings being matched against. The length of the strings is not a factor in
lookup performance.
Set compilation is... kinda slow. I've done my best to make it reasonable but
haven't quantified it.
Security Warning
It is important to remember that the output for strings that aren't in the set
of expected strings is undefined. That means that this library is useful for
rapidly parsing long, known-cardinality, low-entropy strings from trusted
sources, but should not be used (or at least, should be paired with another
data structure that performs set membership queries) to make security
decisions based on untrusted input.
strings
slice_id
uses a generic definition of String, which can be seen in str/string.go
:
type String[S Symbol] []S
type Symbol interface {
~int | ~int8 | ~int16 | ~int32 | ~int64 | ~uint | ~uint8 | ~uint16 |
~uint32 | ~uint64 | ~uintptr | ~float32 | ~float64
}
Usage
package main
import (
"fmt"
"codeberg.org/alice_systems/slice_id"
"codeberg.org/alice_systems/slice_id/str"
)
func main() {
set := slice_id.NewBytes([]str.String[byte]{
[]byte("example1"),
[]byte("example2"),
[]byte("test-string"),
[]byte("test-strong"),
})
fmt.Println(set.Identify([]byte("example1"))) // 0
fmt.Println(set.Identify([]byte("example2"))) // 1
fmt.Println(set.Identify([]byte("test-string"))) // 2
fmt.Println(set.Identify([]byte("test-strong"))) // 3
m := slice_id.NewMap[string](set)
m.Set([]byte("example1"), "test-value")
m.Set([]byte("example2"), "another-test-value")
fmt.Println(m.Get([]byte("example1"))) // "test-value"
fmt.Println(m.Get([]byte("example2"))) // "another-test-value"
}
Implementation
slice_id
builds a finite state machine called a "program", that takes the
form of an array of "instructions". Each instruction specifies an index in the
input to look up, and a value (a "pivot") to compare the result to. The
instruction also specifies a pair of indices in the instruction array to jump
to, one for if the value is greater than or equal to the pivot, and one for if
it is less.
The jump indices are signed ints, allowing them to serve double-duty to mark
the termination of the program, as well as store the final return value.
Storing this value explicitly (instead of using the index of the final
instruction, plus an extra bit for the final compare operation) allows us to
store arbitrary values, making it trivial to constrain the range of possible
outputs. This is desirable for use cases that directly use the returned value
as an array index to implement a Map-like data structure.
If you imagine the instructions as nodes in a directed graph, and the jump
indices as edges, and negative jump indices as edges to an imaginary "end"
node, you end up with a binary search tree:

The way these nodes are built up is a pretty simple recursive process:
- If the length of the slice of strings is 1, return a negative number from
a monotonically decreasing counter.
- Find the longest common prefix between all the strings in the slice. Call the
position immediately following the end of this prefix
i
.
- Find the "best" "pivot" symbol
p
, that is, the symbol with the "balance"
value closest to zero. "Balance" is a measure of how nicely a symbol bisects
the strings in the slice. Each string's i
'th symbol is compared to the
"pivot" symbol; if the pivot symbol is greater than or equal to the string's
symbol, the balance is incremented by one; if the pivot symbol is less than
the string's symbol, the balance is decremented. If the balance is equal to
zero, this means comparing the pivot to each string's symbol at i
will
divide the slice perfectly in half.
slice/Slice.FindPivot
implements a binary search to find this pivot in
linear time with respect to the Symbol length
- optionally, repeat this step for a number of subsequent indices,
defined by the Lookahead option. In this case, the
i
is set to index
that produces the best pivot symbol, and p
is set to that pivot.
- Split the slice of strings into 2 subslices, with the "high" slice containing
the strings whose
i
'th symbol is greater than or equal to the pivot
symbol, and the "low" slice containing the remaining strings. slice/Slice
implements a virtual subslice, which is actually just a pointer to a single
shared slice containing all the strings, and a bitmap that tracks which
slices to consider present in the virtual subslice. This means the splitting
operation only requires copying a pointer and the bitmap.
- Create a new Instruction.
- Set the Instruction's
index
field to i
.
- Set the Instruction's
pivot
field to p
.
- For each of the "high" and "low" subslices, repeat this process from the
beginning. Use the respective return value from these repetitions as the
values for the
high
and low
fields of the Instruction.
- Append the Instruction to the "program" array, and return its index.
Possible Improvements
Please note that this section is a bit rambly and speculative. It's also a bit
out of date, as some less dramatic changes ended up improving tree balance
significantly (at the cost of increased Set compilation time).
Tree Balance
The binary search tree is poorly balanced. It remains unclear whether it is
possible to apply tree rotations to the structure while maintaining
correctness; thus it is unclear whether it is possible to use a standard
tree-balancing algorithm to re-balance the tree.
The reason why this is not obvious is that each instruction bisects the subset
of possible values provided to it by the upstream instruction. The instruction
was only selected to bisect this particular set; if the depth of the tree is
reduced, some instructions will instead be bisecting a different, larger set
than the one they were originally selected for. It seems like it would be
a matter of chance whether the instruction would happen to also bisect this
other set. At a minimum, it would likely be necessary to re-search for pivots
for any nodes downstream of a rotation.
The pivots themselves may also just be suboptimal. Currently, the index used
for the comparison is identified by simply finding the longest common prefix of
the set of strings being bisected. Once this index is found, the change in
value of this index closet to the middle of the set is chosen as the place to
perform the comparison. Perhaps it would be better to search a few different
indices, beyond this first position, and select the one whose pivot is closest
to the middle. This would increase the cost of compiling the program though, so
it would probably make sense to provide a "depth" parameter to allow the user
to tune the number of alternative pivot indices considered depending on their
performance requirements.
Another option would be to treat the strings as bitmaps. Instead of indexing
words within the string to perform comparisions on, perhaps it would be simpler
to index bits within the string; instead of the program branching on the result
of a comparison operation, the program branches on the value of the bit at that
index. Combining this with the concept of considering indices beyond the first
non-uniform index also raises the possibility of creating non-obvious axes of
bisection, which could be useful for improving balancing. This also could
reduce the instruction size: the index size has to be larger to contain the
bit-level instead of word-level position, but in exchange the "pivot" value is
removed entirely.
Another option: remove the sorting requirement. This would complicate
findPivot
; instead of scanning outwards from the middle of the slice, it
would instead perform a binary search. It starts with a pivot value of n/2
(with n = the maximum value of the symbol type), and counts how many values at
the pivot index are greater than or equal to it. Depending on the bias of the
values (either more values above or more values below) it then chooses a new
pivot at either n/4
or 3n/4
, and continues on recursively until either the
number of values above and below are equal, or the number of steps is equal to
the number of bits in the symbol type. Note that this wouldn't just be
necessary to remove the sorting requirement, it would also be needed for any
searches beyond the first non-common index.
This option would be expensive as splitting the slice would no longer be
a simple matter of slicing it; instead it would need to go through the slice
and select values for one or the other sub-slice based on the pivot identified.
This cost may be partially offset by the removal of the sort operation
up-front.
Random unrelated idea: treat the symbol indices as modulo length. That way,
shorter strings are not excluded from comparisons at higher indices.
Orthogonal Comparisons
This points to a more fundamental limitation: the comparison operations are not
independent. This algorithm is essentially a binary search; the strings are
ordered along a single dimension and this one-dimensional array is repeatedly
bisected until the size reaches 1.
However, with the exact same instructions, it should be possible to instead
arrange the set of possible inputs into an n-dimensional space (where
n = log2(|inputs|)
), with n orthogonal axes that are each bisected by
a single instruction. However, it seems inevitable that there are sets of
strings that cannot be bisected along entirely orthogonal axes, which would
then require a fallback to partitioning the space. This would increase the
program size and number of operations above the theoretical minimum. An obvious
example of this would be a set of mostly short strings, with a few long strings
that all have a common prefix longer than the short strings. There is no axis
along which you can bisect the set of long strings that would even touch the
short strings. Solving this obvious case would this limitation would require
a more creative way to index into strings of variable length, but this wouldn't
fix subtler examples.
Implementing this would require thinking about the problem in a completely
different way from the Trie-inspired
method the current compiler uses (implemented in program/program.go
).
I currently lack the theoretical knowledge to attempt anything beyond
a brute-force implementation of such a method, but it appears to be a very
promising direction to explore in the future, as this would remove any ordering
requirements from the comparisons, making the prospect of on-the-fly
modifications to the program more tenable.
Update: this would require a different instruction type to fully realize the
benefits; fully orthogonal terms do no need to be arranged into a tree, they
can instead just be evaluated sequentially with each term producing one bit of
output. The orthogonality benefits could be partially realized with the
existing instruction format, by assembling them into a tree, but without adding
any kind of storage to the state machine the tree would necessarily have
redundant nodes.
Simple machine change: if the instruction's two jump targets are equal, this
tells the machine to add one bit of state to its storage, containing the result
of the comparison. This means that comparisons either provide a branching path,
or an orthogonal bit of state. However unless these orthogonal axes are all
perfectly balanced, the contiguous output range is lost.
Synthesis of these ideas
- We look at bits rather than words
- We use longest-common-prefix to identify a starting index to look for
approximately-bisecting bits
- We then linearly scan forward, finding additional bits that bisect all
sub-slices. Perhaps we use longest-common-prefix on sub-slices as a heuristic
to find promising sub-slices.
worse idea:
- scan through all bits of all slices, creating a "balance" metric for each
position
- sort the indices by balance
- choose the highest-balance bit for the first cut, linearly scan forward
through this sorted list as described above
- This performs very poorly for sets of mostly short strings with a few long
ones with long shared prefixes: as any bit position that uniquely identifies
these strings will have a low balance
- also: balance doesn't in any way represent orthogonality to the currently
selected criteria. The usefulness of this balance metric rapidly diminishes
as more criteria are chosen and the orthogonality constraint gets more
intensive (is this true? it still prevents the compiler from considering
low-balance indices that are always a bad idea too early)
Abstractly, what we probably want to do is improve the orthogonality of the
instructions, while preserving a notion of not all instructions being
orthogonal.
Perhaps an initial step that performs a somewhat lazy search for fully
orthogonal splits, which are then arranged into a "base" set of instructions,
followed by a set of trees.