Scikit-Learn: Custom Loss Function for GridSearchCV












0















I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:



Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)


I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:



def custom_loss(truth, preds):
truth_logs = np.log(truth)
print(truth_logs)
preds_logs = np.log(preds)
numerator = np.sum(np.square(truth_logs - preds_logs))
return np.sum(np.sqrt(numerator / len(truth)))

custom_scorer = make_scorer(custom_loss, greater_is_better=False)


Two questions:



1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?



I looked into the docs but they weren't super helpful re: what my custom loss function should return.



2) When I run:



xgb_model = xgb.XGBRegressor()
params = {"max_depth": [3, 4], "learning_rate": [0.05],
"n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)

grid_search_cv.fit(X, y)

grid_search_cv.best_score_


I get back:



-0.12137097567803554


which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.



Any idea why it's negative?



Thanks!










share|improve this question



























    0















    I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:



    Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)


    I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:



    def custom_loss(truth, preds):
    truth_logs = np.log(truth)
    print(truth_logs)
    preds_logs = np.log(preds)
    numerator = np.sum(np.square(truth_logs - preds_logs))
    return np.sum(np.sqrt(numerator / len(truth)))

    custom_scorer = make_scorer(custom_loss, greater_is_better=False)


    Two questions:



    1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?



    I looked into the docs but they weren't super helpful re: what my custom loss function should return.



    2) When I run:



    xgb_model = xgb.XGBRegressor()
    params = {"max_depth": [3, 4], "learning_rate": [0.05],
    "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
    grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
    n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)

    grid_search_cv.fit(X, y)

    grid_search_cv.best_score_


    I get back:



    -0.12137097567803554


    which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.



    Any idea why it's negative?



    Thanks!










    share|improve this question

























      0












      0








      0








      I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:



      Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)


      I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:



      def custom_loss(truth, preds):
      truth_logs = np.log(truth)
      print(truth_logs)
      preds_logs = np.log(preds)
      numerator = np.sum(np.square(truth_logs - preds_logs))
      return np.sum(np.sqrt(numerator / len(truth)))

      custom_scorer = make_scorer(custom_loss, greater_is_better=False)


      Two questions:



      1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?



      I looked into the docs but they weren't super helpful re: what my custom loss function should return.



      2) When I run:



      xgb_model = xgb.XGBRegressor()
      params = {"max_depth": [3, 4], "learning_rate": [0.05],
      "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
      grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
      n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)

      grid_search_cv.fit(X, y)

      grid_search_cv.best_score_


      I get back:



      -0.12137097567803554


      which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.



      Any idea why it's negative?



      Thanks!










      share|improve this question














      I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:



      Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)


      I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:



      def custom_loss(truth, preds):
      truth_logs = np.log(truth)
      print(truth_logs)
      preds_logs = np.log(preds)
      numerator = np.sum(np.square(truth_logs - preds_logs))
      return np.sum(np.sqrt(numerator / len(truth)))

      custom_scorer = make_scorer(custom_loss, greater_is_better=False)


      Two questions:



      1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?



      I looked into the docs but they weren't super helpful re: what my custom loss function should return.



      2) When I run:



      xgb_model = xgb.XGBRegressor()
      params = {"max_depth": [3, 4], "learning_rate": [0.05],
      "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
      grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
      n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)

      grid_search_cv.fit(X, y)

      grid_search_cv.best_score_


      I get back:



      -0.12137097567803554


      which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.



      Any idea why it's negative?



      Thanks!







      python scikit-learn loss-function






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 27 '18 at 22:19









      bclaymanbclayman

      2,12652457




      2,12652457
























          2 Answers
          2






          active

          oldest

          votes


















          0














          1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.



          By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.



          2) Why does it return negative? - Without your actual data and complete code we cant say.






          share|improve this answer































            0














            You should be careful with the notation.



            There are 2 levels of optimization here:




            1. The loss function optimized when the XGBRegressor is fitted to the data.

            2. The scoring function that is optimized during the grid search.


            I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
            However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.



            Note, that passing customized loss functions is not supported at the moment (see reasons here and here).






            share|improve this answer

























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48481134%2fscikit-learn-custom-loss-function-for-gridsearchcv%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0














              1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.



              By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.



              2) Why does it return negative? - Without your actual data and complete code we cant say.






              share|improve this answer




























                0














                1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.



                By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.



                2) Why does it return negative? - Without your actual data and complete code we cant say.






                share|improve this answer


























                  0












                  0








                  0







                  1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.



                  By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.



                  2) Why does it return negative? - Without your actual data and complete code we cant say.






                  share|improve this answer













                  1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.



                  By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.



                  2) Why does it return negative? - Without your actual data and complete code we cant say.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 28 '18 at 2:58









                  Vivek KumarVivek Kumar

                  15.7k41953




                  15.7k41953

























                      0














                      You should be careful with the notation.



                      There are 2 levels of optimization here:




                      1. The loss function optimized when the XGBRegressor is fitted to the data.

                      2. The scoring function that is optimized during the grid search.


                      I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
                      However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.



                      Note, that passing customized loss functions is not supported at the moment (see reasons here and here).






                      share|improve this answer






























                        0














                        You should be careful with the notation.



                        There are 2 levels of optimization here:




                        1. The loss function optimized when the XGBRegressor is fitted to the data.

                        2. The scoring function that is optimized during the grid search.


                        I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
                        However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.



                        Note, that passing customized loss functions is not supported at the moment (see reasons here and here).






                        share|improve this answer




























                          0












                          0








                          0







                          You should be careful with the notation.



                          There are 2 levels of optimization here:




                          1. The loss function optimized when the XGBRegressor is fitted to the data.

                          2. The scoring function that is optimized during the grid search.


                          I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
                          However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.



                          Note, that passing customized loss functions is not supported at the moment (see reasons here and here).






                          share|improve this answer















                          You should be careful with the notation.



                          There are 2 levels of optimization here:




                          1. The loss function optimized when the XGBRegressor is fitted to the data.

                          2. The scoring function that is optimized during the grid search.


                          I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
                          However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.



                          Note, that passing customized loss functions is not supported at the moment (see reasons here and here).







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 23 '18 at 23:52

























                          answered Nov 23 '18 at 23:25









                          dopexxxdopexxx

                          608616




                          608616






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48481134%2fscikit-learn-custom-loss-function-for-gridsearchcv%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Berounka

                              Fiat S.p.A.

                              Type 'String' is not a subtype of type 'int' of 'index'