design of a scalable, stateful python RabbitMQ consumer












0















I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.




  • One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
    This payload is self-contained, i.e. it contains all the information required by the application to perform its job.


  • Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.



Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.




  • For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

  • For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.


Therefore, for [2] I can see two options:




  • use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
    Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

  • use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
    In this case the parallelism is known to rabbit because the queue will have multiple consumers.


Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.



Any advice on how to best implement a scalable architecture?










share|improve this question



























    0















    I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
    These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.




    • One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
      This payload is self-contained, i.e. it contains all the information required by the application to perform its job.


    • Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.



    Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.




    • For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

    • For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.


    Therefore, for [2] I can see two options:




    • use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
      Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

    • use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
      In this case the parallelism is known to rabbit because the queue will have multiple consumers.


    Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
    Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.



    Any advice on how to best implement a scalable architecture?










    share|improve this question

























      0












      0








      0








      I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
      These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.




      • One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
        This payload is self-contained, i.e. it contains all the information required by the application to perform its job.


      • Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.



      Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.




      • For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

      • For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.


      Therefore, for [2] I can see two options:




      • use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
        Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

      • use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
        In this case the parallelism is known to rabbit because the queue will have multiple consumers.


      Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
      Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.



      Any advice on how to best implement a scalable architecture?










      share|improve this question














      I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
      These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.




      • One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
        This payload is self-contained, i.e. it contains all the information required by the application to perform its job.


      • Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.



      Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.




      • For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

      • For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.


      Therefore, for [2] I can see two options:




      • use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
        Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

      • use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
        In this case the parallelism is known to rabbit because the queue will have multiple consumers.


      Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
      Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.



      Any advice on how to best implement a scalable architecture?







      python multithreading rabbitmq multiprocessing






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 23 '18 at 17:15









      MoZZoZoZZoMoZZoZoZZo

      10810




      10810
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450708%2fdesign-of-a-scalable-stateful-python-rabbitmq-consumer%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450708%2fdesign-of-a-scalable-stateful-python-rabbitmq-consumer%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Berounka

          Fiat S.p.A.

          Type 'String' is not a subtype of type 'int' of 'index'