Gradient checking in neural network with dot product

I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
$$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
being used to calculate the similarity i.e. of two vectors.
Why not use a cosine similarity?

edited Dec 4 '18 at 16:36

asked Dec 4 '18 at 15:39

KAY_YAK

155

add a comment |

edited Dec 4 '18 at 16:36

asked Dec 4 '18 at 15:39

KAY_YAK

155

add a comment |

edited Dec 4 '18 at 16:36

asked Dec 4 '18 at 15:39

KAY_YAK

155

vectors neural-networks

edited Dec 4 '18 at 16:36

asked Dec 4 '18 at 15:39

KAY_YAK

155

edited Dec 4 '18 at 16:36

asked Dec 4 '18 at 15:39

KAY_YAK

155

edited Dec 4 '18 at 16:36

asked Dec 4 '18 at 15:39

KAY_YAK

155

asked Dec 4 '18 at 15:39

KAY_YAK

155

asked Dec 4 '18 at 15:39

KAY_YAK

155

add a comment |

2 Answers
2

active

oldest

votes

Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39

Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18

add a comment |

The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.

One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor

$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$

For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.

To avoid this problem, divide by the length of the vectors

$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$

which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful

answered Dec 4 '18 at 16:40

caverac

13.9k21130

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3025720%2fgradient-checking-in-neural-network-with-dot-product%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39

Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18

add a comment |

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39

Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18

add a comment |

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

answered Dec 4 '18 at 16:06

J.G.

23.2k22137

Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39

Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18

add a comment |

Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39

Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18

Well I can just place some checks on the lengths. Also, this form is not so intuitive.
– KAY_YAK
Dec 4 '18 at 16:39

Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
– J.G.
Dec 4 '18 at 17:18

add a comment |

The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.

$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$

For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.

To avoid this problem, divide by the length of the vectors

$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$

which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful

answered Dec 4 '18 at 16:40

caverac

13.9k21130

add a comment |

The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.

$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$

For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.

To avoid this problem, divide by the length of the vectors

$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$

which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful

answered Dec 4 '18 at 16:40

caverac

13.9k21130

add a comment |

The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.

$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$

For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.

To avoid this problem, divide by the length of the vectors

$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$

which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful

answered Dec 4 '18 at 16:40

caverac

13.9k21130

The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.

$$
| k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
$$

For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.

To avoid this problem, divide by the length of the vectors

$$
frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
$$

which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful

answered Dec 4 '18 at 16:40

caverac

13.9k21130

answered Dec 4 '18 at 16:40

caverac

13.9k21130

answered Dec 4 '18 at 16:40

caverac

13.9k21130

answered Dec 4 '18 at 16:40

caverac

13.9k21130

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut