Gradient checking in neural network with dot product
I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
$$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
being used to calculate the similarity i.e. of two vectors.
Why not use a cosine similarity?
vectors neural-networks
add a comment |
I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
$$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
being used to calculate the similarity i.e. of two vectors.
Why not use a cosine similarity?
vectors neural-networks
add a comment |
I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
$$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
being used to calculate the similarity i.e. of two vectors.
Why not use a cosine similarity?
vectors neural-networks
I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
$$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
being used to calculate the similarity i.e. of two vectors.
Why not use a cosine similarity?
vectors neural-networks
vectors neural-networks
edited Dec 4 '18 at 16:36
asked Dec 4 '18 at 15:39
KAY_YAK
155
155
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
add a comment |
The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.
One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor
$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$
For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.
To avoid this problem, divide by the length of the vectors
$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$
which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3025720%2fgradient-checking-in-neural-network-with-dot-product%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
add a comment |
Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
add a comment |
Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.
Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.
answered Dec 4 '18 at 16:06
J.G.
23.2k22137
23.2k22137
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
add a comment |
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18
add a comment |
The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.
One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor
$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$
For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.
To avoid this problem, divide by the length of the vectors
$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$
which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful
add a comment |
The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.
One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor
$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$
For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.
To avoid this problem, divide by the length of the vectors
$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$
which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful
add a comment |
The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.
One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor
$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$
For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.
To avoid this problem, divide by the length of the vectors
$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$
which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful
The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.
One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor
$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$
For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.
To avoid this problem, divide by the length of the vectors
$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$
which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful
answered Dec 4 '18 at 16:40
caverac
13.9k21130
13.9k21130
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3025720%2fgradient-checking-in-neural-network-with-dot-product%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown