Scikit-Learn: Custom Loss Function for GridSearchCV

I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:

def custom_loss(truth, preds):

    truth_logs = np.log(truth)

    print(truth_logs)

    preds_logs = np.log(preds)

    numerator = np.sum(np.square(truth_logs - preds_logs))

    return np.sum(np.sqrt(numerator / len(truth)))



custom_scorer = make_scorer(custom_loss, greater_is_better=False)

Two questions:

1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?

I looked into the docs but they weren't super helpful re: what my custom loss function should return.

2) When I run:

xgb_model = xgb.XGBRegressor()

params = {"max_depth": [3, 4], "learning_rate": [0.05],

         "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}

grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,

                             n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)



grid_search_cv.fit(X, y)



grid_search_cv.best_score_

I get back:

-0.12137097567803554

which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.

Any idea why it's negative?

Thanks!

asked Jan 27 '18 at 22:19

bclayman

2,12652457

add a comment |

I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:

def custom_loss(truth, preds):

    truth_logs = np.log(truth)

    print(truth_logs)

    preds_logs = np.log(preds)

    numerator = np.sum(np.square(truth_logs - preds_logs))

    return np.sum(np.sqrt(numerator / len(truth)))



custom_scorer = make_scorer(custom_loss, greater_is_better=False)

Two questions:

I looked into the docs but they weren't super helpful re: what my custom loss function should return.

2) When I run:

xgb_model = xgb.XGBRegressor()

params = {"max_depth": [3, 4], "learning_rate": [0.05],

         "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}

grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,

                             n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)



grid_search_cv.fit(X, y)



grid_search_cv.best_score_

I get back:

-0.12137097567803554

which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.

Any idea why it's negative?

Thanks!

asked Jan 27 '18 at 22:19

bclayman

2,12652457

add a comment |

I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:

def custom_loss(truth, preds):

    truth_logs = np.log(truth)

    print(truth_logs)

    preds_logs = np.log(preds)

    numerator = np.sum(np.square(truth_logs - preds_logs))

    return np.sum(np.sqrt(numerator / len(truth)))



custom_scorer = make_scorer(custom_loss, greater_is_better=False)

Two questions:

I looked into the docs but they weren't super helpful re: what my custom loss function should return.

2) When I run:

xgb_model = xgb.XGBRegressor()

params = {"max_depth": [3, 4], "learning_rate": [0.05],

         "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}

grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,

                             n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)



grid_search_cv.fit(X, y)



grid_search_cv.best_score_

I get back:

-0.12137097567803554

which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.

Any idea why it's negative?

Thanks!

asked Jan 27 '18 at 22:19

bclayman

2,12652457

I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:

def custom_loss(truth, preds):

    truth_logs = np.log(truth)

    print(truth_logs)

    preds_logs = np.log(preds)

    numerator = np.sum(np.square(truth_logs - preds_logs))

    return np.sum(np.sqrt(numerator / len(truth)))



custom_scorer = make_scorer(custom_loss, greater_is_better=False)

Two questions:

I looked into the docs but they weren't super helpful re: what my custom loss function should return.

2) When I run:

xgb_model = xgb.XGBRegressor()

params = {"max_depth": [3, 4], "learning_rate": [0.05],

         "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}

grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,

                             n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)



grid_search_cv.fit(X, y)



grid_search_cv.best_score_

I get back:

-0.12137097567803554

which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.

Any idea why it's negative?

Thanks!

python scikit-learn loss-function

asked Jan 27 '18 at 22:19

bclayman

2,12652457

asked Jan 27 '18 at 22:19

bclayman

2,12652457

asked Jan 27 '18 at 22:19

bclayman

2,12652457

asked Jan 27 '18 at 22:19

bclayman

2,12652457

asked Jan 27 '18 at 22:19

bclayman

2,12652457

add a comment |

2 Answers
2

active

oldest

votes

1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.

By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.

2) Why does it return negative? - Without your actual data and complete code we cant say.

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

add a comment |

You should be careful with the notation.

There are 2 levels of optimization here:

The loss function optimized when the XGBRegressor is fitted to the data.

The scoring function that is optimized during the grid search.

I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.

Note, that passing customized loss functions is not supported at the moment (see reasons here and here).

edited Nov 23 '18 at 23:52

answered Nov 23 '18 at 23:25

dopexxx

608616

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48481134%2fscikit-learn-custom-loss-function-for-gridsearchcv%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.

By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.

2) Why does it return negative? - Without your actual data and complete code we cant say.

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

add a comment |

1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.

By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.

2) Why does it return negative? - Without your actual data and complete code we cant say.

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

add a comment |

1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.

By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.

2) Why does it return negative? - Without your actual data and complete code we cant say.

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.

By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.

2) Why does it return negative? - Without your actual data and complete code we cant say.

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

answered Jan 28 '18 at 2:58

Vivek Kumar

15.7k41953

add a comment |

You should be careful with the notation.

There are 2 levels of optimization here:

The loss function optimized when the XGBRegressor is fitted to the data.

The scoring function that is optimized during the grid search.

Note, that passing customized loss functions is not supported at the moment (see reasons here and here).

edited Nov 23 '18 at 23:52

answered Nov 23 '18 at 23:25

dopexxx

608616

add a comment |

You should be careful with the notation.

There are 2 levels of optimization here:

The loss function optimized when the XGBRegressor is fitted to the data.

The scoring function that is optimized during the grid search.

Note, that passing customized loss functions is not supported at the moment (see reasons here and here).

edited Nov 23 '18 at 23:52

answered Nov 23 '18 at 23:25

dopexxx

608616

add a comment |

You should be careful with the notation.

There are 2 levels of optimization here:

The loss function optimized when the XGBRegressor is fitted to the data.

The scoring function that is optimized during the grid search.

Note, that passing customized loss functions is not supported at the moment (see reasons here and here).

edited Nov 23 '18 at 23:52

answered Nov 23 '18 at 23:25

dopexxx

608616

You should be careful with the notation.

There are 2 levels of optimization here:

The loss function optimized when the XGBRegressor is fitted to the data.

The scoring function that is optimized during the grid search.

Note, that passing customized loss functions is not supported at the moment (see reasons here and here).

edited Nov 23 '18 at 23:52

answered Nov 23 '18 at 23:25

dopexxx

608616

edited Nov 23 '18 at 23:52

answered Nov 23 '18 at 23:25

dopexxx

608616

answered Nov 23 '18 at 23:25

dopexxx

608616

answered Nov 23 '18 at 23:25

dopexxx

608616

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut