Python Optimized Most Cosine Similar Vector
I have about 30,000 vectors and each vector has about 300 elements.
For another vector (with same number elements), how can I efficiently find the most (cosine) similar vector?
This following is one implementation using a python loop:
from time import time
import numpy as np
vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype # (35196, 312) float3
print target.shape, target.dtype # (312,) float32
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print "done with loop in %s seconds" % (time() - start_time) # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity) # index 2399 with 0.772758982696
The following with removed python loop is 44x faster, but isn't the same computation:
print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time) # 0.0105748176575 seconds
Is there a way to get this speedup associated with numpy doing the iterations without loosing the max index logic and the division of the normal product? For optimizing calculations like this, would it make sense to just do the calculations in C?
python numpy optimization
add a comment |
I have about 30,000 vectors and each vector has about 300 elements.
For another vector (with same number elements), how can I efficiently find the most (cosine) similar vector?
This following is one implementation using a python loop:
from time import time
import numpy as np
vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype # (35196, 312) float3
print target.shape, target.dtype # (312,) float32
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print "done with loop in %s seconds" % (time() - start_time) # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity) # index 2399 with 0.772758982696
The following with removed python loop is 44x faster, but isn't the same computation:
print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time) # 0.0105748176575 seconds
Is there a way to get this speedup associated with numpy doing the iterations without loosing the max index logic and the division of the normal product? For optimizing calculations like this, would it make sense to just do the calculations in C?
python numpy optimization
add a comment |
I have about 30,000 vectors and each vector has about 300 elements.
For another vector (with same number elements), how can I efficiently find the most (cosine) similar vector?
This following is one implementation using a python loop:
from time import time
import numpy as np
vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype # (35196, 312) float3
print target.shape, target.dtype # (312,) float32
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print "done with loop in %s seconds" % (time() - start_time) # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity) # index 2399 with 0.772758982696
The following with removed python loop is 44x faster, but isn't the same computation:
print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time) # 0.0105748176575 seconds
Is there a way to get this speedup associated with numpy doing the iterations without loosing the max index logic and the division of the normal product? For optimizing calculations like this, would it make sense to just do the calculations in C?
python numpy optimization
I have about 30,000 vectors and each vector has about 300 elements.
For another vector (with same number elements), how can I efficiently find the most (cosine) similar vector?
This following is one implementation using a python loop:
from time import time
import numpy as np
vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype # (35196, 312) float3
print target.shape, target.dtype # (312,) float32
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print "done with loop in %s seconds" % (time() - start_time) # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity) # index 2399 with 0.772758982696
The following with removed python loop is 44x faster, but isn't the same computation:
print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time) # 0.0105748176575 seconds
Is there a way to get this speedup associated with numpy doing the iterations without loosing the max index logic and the division of the normal product? For optimizing calculations like this, would it make sense to just do the calculations in C?
python numpy optimization
python numpy optimization
asked Nov 24 '18 at 6:54
JDiMatteoJDiMatteo
4,58412644
4,58412644
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
You have the correct idea about avoiding the loop to get performance. You can use argmin
to get the minimum distance index.
Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.
import numpy as np
from scipy.spatial import distance
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance
HTH.
add a comment |
Edit: Hats off to @Deepak. cdist is the fastest, if you do need the actual computed value.
from scipy.spatial import distance
start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.013602018356323242 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
from time import time
import numpy as np
vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))
start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with max dot in 0.009701013565063477 seconds
Most similar vector to target is index 12187 with 645549.917200941
max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.49567198753356934 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
def my_func(candidate,target):
return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.7495708465576172 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.04306602478027344 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53455909%2fpython-optimized-most-cosine-similar-vector%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You have the correct idea about avoiding the loop to get performance. You can use argmin
to get the minimum distance index.
Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.
import numpy as np
from scipy.spatial import distance
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance
HTH.
add a comment |
You have the correct idea about avoiding the loop to get performance. You can use argmin
to get the minimum distance index.
Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.
import numpy as np
from scipy.spatial import distance
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance
HTH.
add a comment |
You have the correct idea about avoiding the loop to get performance. You can use argmin
to get the minimum distance index.
Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.
import numpy as np
from scipy.spatial import distance
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance
HTH.
You have the correct idea about avoiding the loop to get performance. You can use argmin
to get the minimum distance index.
Though, I would change the distance calculation to scipy cdist as well. This way you can calculate distances to multiple targets and would be able to choose from several distance metrics, if need be.
import numpy as np
from scipy.spatial import distance
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance
HTH.
edited Nov 24 '18 at 17:23
JDiMatteo
4,58412644
4,58412644
answered Nov 24 '18 at 7:32
Deepak SainiDeepak Saini
1,582814
1,582814
add a comment |
add a comment |
Edit: Hats off to @Deepak. cdist is the fastest, if you do need the actual computed value.
from scipy.spatial import distance
start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.013602018356323242 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
from time import time
import numpy as np
vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))
start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with max dot in 0.009701013565063477 seconds
Most similar vector to target is index 12187 with 645549.917200941
max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.49567198753356934 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
def my_func(candidate,target):
return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.7495708465576172 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.04306602478027344 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
add a comment |
Edit: Hats off to @Deepak. cdist is the fastest, if you do need the actual computed value.
from scipy.spatial import distance
start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.013602018356323242 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
from time import time
import numpy as np
vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))
start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with max dot in 0.009701013565063477 seconds
Most similar vector to target is index 12187 with 645549.917200941
max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.49567198753356934 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
def my_func(candidate,target):
return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.7495708465576172 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.04306602478027344 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
add a comment |
Edit: Hats off to @Deepak. cdist is the fastest, if you do need the actual computed value.
from scipy.spatial import distance
start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.013602018356323242 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
from time import time
import numpy as np
vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))
start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with max dot in 0.009701013565063477 seconds
Most similar vector to target is index 12187 with 645549.917200941
max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.49567198753356934 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
def my_func(candidate,target):
return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.7495708465576172 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.04306602478027344 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
Edit: Hats off to @Deepak. cdist is the fastest, if you do need the actual computed value.
from scipy.spatial import distance
start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.013602018356323242 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
from time import time
import numpy as np
vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))
start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with max dot in 0.009701013565063477 seconds
Most similar vector to target is index 12187 with 645549.917200941
max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.49567198753356934 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
def my_func(candidate,target):
return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.7495708465576172 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
done with loop in 0.04306602478027344 seconds
Most similar vector to target is index 11001 with 0.2250217098612361
edited Nov 24 '18 at 8:21
answered Nov 24 '18 at 8:02
tengteng
817721
817721
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53455909%2fpython-optimized-most-cosine-similar-vector%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown