design of a scalable, stateful python RabbitMQ consumer

I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.

One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.

Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.

Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.

For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.

Therefore, for [2] I can see two options:

use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.

Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.

Any advice on how to best implement a scalable architecture?

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

add a comment |

One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.

Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.

Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.

For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.

Therefore, for [2] I can see two options:

use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.

Any advice on how to best implement a scalable architecture?

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

add a comment |

One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.

Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.

Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.

For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.

Therefore, for [2] I can see two options:

use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.

Any advice on how to best implement a scalable architecture?

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.

Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.

Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.

For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.

For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.

Therefore, for [2] I can see two options:

use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit.

use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.

Any advice on how to best implement a scalable architecture?

python multithreading rabbitmq multiprocessing

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

asked Nov 23 '18 at 17:15

MoZZoZoZZo

10810

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450708%2fdesign-of-a-scalable-stateful-python-rabbitmq-consumer%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut