You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Complete the renaming of the package from 'LSH.jl' to 'LSHFunctions.jl'.
The Julia General registry requires that modules names (a) are at least five letters long; and (b) end in a lowercase letter, except with admin approval. While it's possible that we'll eventually be able to alias 'LSH' to 'LSHFunctions' or some such, for now it seems that the best course of action is simply to rename the module to LSHFunctions. This will have the additional benefit of creating a naming scheme for future packages, e.g. LSHTables.
Squashed commit of the following:
commit 2f6284f
Author: kernelmethod <[email protected]>
Date: Mon Jan 20 14:58:29 2020 -0700
Change from the LSH module to LSHFunctions in the documentation.
commit 89dbc40
Author: kernelmethod <[email protected]>
Date: Mon Jan 20 14:30:07 2020 -0700
Remove remaining references to the LSH package/module, and replace them with references to the LSHFunctions package/module.
commit 7531cb1
Author: kernelmethod <[email protected]>
Date: Mon Jan 20 14:27:25 2020 -0700
Remove usages of the LSH module / package in the tests, and replaced them with LSHFunctions.
commit b754fc7
Author: kernelmethod <[email protected]>
Date: Mon Jan 20 14:19:04 2020 -0700
Change the register_similarity! macro to generate LSHFunctions.LSHFunction and LSHFunctions.lsh_family, rather than LSH.LSHFunction and LSHFunctions.lsh_family.
commit f5b80bc
Author: kernelmethod <[email protected]>
Date: Mon Jan 20 14:15:32 2020 -0700
Rename LSH.jl to LSHFunctions.jl.
Copy file name to clipboardExpand all lines: docs/src/faq.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ The reason for computing multiple hashes is that every LSH function provides (at
7
7
8
8
In fact, the situation can be much more dire than that. If your data are highly structured, it is likely that each of your hashes will place data points into a tiny handful of buckets -- even just one bucket. For instance, in the snippet below we have a dataset of 100 points that all have very high cosine similarity with one another. If we only create a single hash function when we call [`SimHash`](@ref), then it's very likely that all of the data points will have the same hash.
julia> data = ones(10, 100); # Each column is a data point
@@ -24,7 +24,7 @@ julia> unique(hashes)
24
24
The solution to this is to generate multiple hash functions, and combine each of the hashes we compute for an input into a single key. In the snippet below, we create 20 hash functions with [`SimHash`](@ref). Each hash computed in `map(x -> hashfn(x), eachcol(data))` is a length-20 `BitArray`.
If you want to know what hash function will be created for a given similarity, you can use [`lsh_family`](@ref):
66
66
67
-
```jldoctest; setup = :(using LSH)
67
+
```jldoctest; setup = :(using LSHFunctions)
68
68
julia> lsh_family(jaccard)
69
69
MinHash
70
70
@@ -77,7 +77,7 @@ LSHFunctions.jl provides a few common utility functions that you can use across
77
77
78
78
-[`n_hashes`](@ref): returns the number of hash functions computed by an [`LSHFunction`](@ref).
79
79
80
-
```jldoctest; setup = :(using LSH)
80
+
```jldoctest; setup = :(using LSHFunctions)
81
81
julia> hashfn = LSHFunction(jaccard);
82
82
83
83
julia> n_hashes(hashfn)
@@ -96,7 +96,7 @@ julia> length(hashes)
96
96
97
97
-[`similarity`](@ref): returns the similarity function for which the input [`LSHFunction`](@ref) is locality-sensitive:
98
98
99
-
```jldoctest; setup = :(using LSH)
99
+
```jldoctest; setup = :(using LSHFunctions)
100
100
julia> hashfn = LSHFunction(cossim);
101
101
102
102
julia> similarity(hashfn)
@@ -105,7 +105,7 @@ cossim (generic function with 2 methods)
105
105
106
106
-[`hashtype`](@ref): returns the type of hash computed by the input hash function. Note that in practice `hashfn(x)` (or [`index_hash(hashfn,x)`](@ref) and [`query_hash(hashfn,x)`](@ref) for an [`AsymmetricLSHFunction`](@ref)) will return an array of hashes, one for each hash function you generated. [`hashtype`](@ref) is the data type of each element of `hashfn(x)`.
107
107
108
-
```jldoctest; setup = :(using LSH)
108
+
```jldoctest; setup = :(using LSHFunctions)
109
109
julia> hashfn = LSHFunction(cossim, 5);
110
110
111
111
julia> hashtype(hashfn)
@@ -122,7 +122,7 @@ true
122
122
123
123
-[`collision_probability`](@ref): returns the probability of collision for two inputs with a given similarity. For instance, the probability that a single MinHash hash function causes a collision between inputs `A` and `B` is equal to [`jaccard(A,B)`](@ref jaccard):
124
124
125
-
```jldoctest; setup = :(using LSH)
125
+
```jldoctest; setup = :(using LSHFunctions)
126
126
julia> hashfn = MinHash();
127
127
128
128
julia> A = Set(["a", "b", "c"]);
@@ -137,7 +137,7 @@ true
137
137
138
138
We often want to compute the probability that not just one hash collides, but that multiple hashes collide simultaneously. You can calculate this using the `n_hashes` keyword argument. If left unspecified, then [`collision_probability`](@ref) will use [`n_hashes(hashfn)`](@ref n_hashes) hash functions to compute the probability.
Copy file name to clipboardExpand all lines: docs/src/similarities/cosine.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ Concretely, cosine similarity is computed as
13
13
where ``\left\langle\cdot,\cdot\right\rangle`` is an inner product (e.g., dot product) and ``\|\cdot\|`` is the norm derived from that inner product. ``\text{cossim}(x,y)`` goes from ``-1`` to ``1``, where ``-1`` corresponds to low similarity and ``1`` corresponds to high similarity. To calculate cosine similarity, you can use the [`cossim`](@ref) function exported from the `LSH` module:
14
14
15
15
```jldoctest
16
-
julia> using LSH, LinearAlgebra
16
+
julia> using LSHFunctions, LinearAlgebra
17
17
18
18
julia> x = [5, 3, -1, 1]; # norm(x) == 6
19
19
@@ -29,7 +29,7 @@ true
29
29
## SimHash
30
30
*SimHash*[^1][^2] is a family of LSH functions for hashing with respect to cosine similarity. You can generate a new hash function from this family by calling [`SimHash`](@ref):
31
31
32
-
```jldoctest; setup = :(using LSH)
32
+
```jldoctest; setup = :(using LSHFunctions)
33
33
julia> hashfn = SimHash();
34
34
35
35
julia> n_hashes(hashfn)
@@ -43,7 +43,7 @@ julia> n_hashes(hashfn)
43
43
44
44
Once constructed, you can start hashing vectors by calling `hashfn(x)`:
# x and y have high cosine similarity since they point in the same direction
@@ -65,7 +65,7 @@ true
65
65
66
66
Note that [`SimHash`](@ref) is a one-bit hash function. As a result, `hashfn(x)` returns a `BitArray`:
67
67
68
-
```jldoctest; setup = :(using LSH)
68
+
```jldoctest; setup = :(using LSHFunctions)
69
69
julia> hashfn = SimHash();
70
70
71
71
julia> n_hashes(hashfn)
@@ -82,7 +82,7 @@ julia> length(hashes)
82
82
83
83
Since a single-bit hash doesn't do much to reduce the cost of similarity search, you usually want to generate multiple hash functions at once. For instance, in the snippet below we sample 10 hash functions, so that `hashfn(x)` is a length-10 `BitArray`:
84
84
85
-
```jldoctest; setup = :(using LSH)
85
+
```jldoctest; setup = :(using LSHFunctions)
86
86
julia> hashfn = SimHash(10);
87
87
88
88
julia> n_hashes(hashfn)
@@ -101,10 +101,10 @@ The probability of a hash collision (for a single hash) is
101
101
where ``\theta = \text{arccos}(\text{cossim}(x,y))`` is the angle between ``x`` and ``y``. This collision probability is shown in the plot below.
102
102
103
103
```@eval
104
-
using PyPlot, LSH
104
+
using PyPlot, LSHFunctions
105
105
hashfn = SimHash()
106
106
x = range(-1, 1; length=1024)
107
-
y = [LSH.single_hash_collision_probability(hashfn, xii) for xii in x]
107
+
y = [LSHFunctions.single_hash_collision_probability(hashfn, xii) for xii in x]
108
108
109
109
plot(x, y)
110
110
title("Probability of hash collision for SimHash")
If our [`MinHash`](@ref) struct keeps track of `N` hash functions simultaneously, then the probability of collision is `jaccard(A,B)^N`:
97
97
98
-
```jldoctest; setup = :(using LSH)
98
+
```jldoctest; setup = :(using LSHFunctions)
99
99
julia> hashfn = MinHash(10);
100
100
101
101
julia> A = Set(["a", "b", "c"]);
@@ -153,7 +153,7 @@ Computes the probability of a hash collision between two inputs `x` and `y` for
153
153
# Examples
154
154
The following snippet computes the probability of collision between two sets `A` and `B` for a single MinHash. For MinHash, this probability is just equal to the Jaccard similarity between `A` and `B`.
155
155
156
-
```jldoctest; setup = :(using LSH)
156
+
```jldoctest; setup = :(using LSHFunctions)
157
157
julia> hashfn = MinHash();
158
158
159
159
julia> A = Set(["a", "b", "c"]);
@@ -171,7 +171,7 @@ true
171
171
172
172
We can use the `n_hashes` argument to specify the probability that `n_hashes` MinHash hash functions simultaneously collide. If left unspecified, then we'll simply use `n_hashes(hashfn)` as the number of hash functions:
173
173
174
-
```jldoctest; setup = :(using LSH)
174
+
```jldoctest; setup = :(using LSHFunctions)
175
175
julia> hashfn = MinHash(10);
176
176
177
177
julia> A = Set(["a", "b", "c"]);
@@ -200,7 +200,7 @@ Returns the similarity function that `hashfn` hashes on.
200
200
- `hashfn::AbstractLSHFunction`: the hash function whose similarity we would like to retrieve.
201
201
202
202
# Examples
203
-
```jldoctest; setup = :(using LSH)
203
+
```jldoctest; setup = :(using LSHFunctions)
204
204
julia> hashfn = LSHFunction(cossim);
205
205
206
206
julia> similarity(hashfn) == cossim
@@ -220,7 +220,7 @@ function similarity end
220
220
Returns the type of hash generated by a hash function.
0 commit comments