A lot of the posts that I write or think about writing is usually a brain dump for all intensive purposes, and this one is no different.
At work, I am in the process of re-architecting our platform and there are no shortage of questions in need of answers. One of the answers that I need to formulate is what is required in the way of queueing. I have a few of scenarios that I need to cover, so I am going to try to work it out here.
We have the typical objects that any other eCommerce has such as orders, carts and customers. At the same time, we are redesigning our core platform we are also re-evaluating and making changes to our data model. The ability to deploy the services as they are ready carries with it the requirement to make sure that the legacy systems can interop with the new data model and vice versa for some period of time. We need to make sure that these updates are guaranteed to be processed successfully (guaranteed delivery and durability).
The standard ETL process can certainly make the translation and update in either direction, but given that we want this in real time the ETL seems suboptimal. It seems to me that using a message queue where each old to new and new to old translation could add the data changes and queue them up as they happen. Then a consumer could be built to make the translation and update the target data store much closer to realtime.
We also want to collect as much logged data as we can get away with.This means that we need to get the data off of the generating process as quickly as possible so the threads are not blocked while the persistence takes place (fire and forget). Different from the translation requirements above, our log data can endure some data loss. Optimally we would to capture all of it, but we are okay losing some of it.
The other high volume data that we want to capture are the user actions that happen in the browser. This changes the landscape a bit because of the volume of data that will be produced. Added to the volume challenge is the ability to multi-cast the stream of data coming in so it can be analyzed in realtime as well as being sent to a archived data store for analysis later.
Lastly, I need to be able to queue work to be performed immediately or at defined times (e.g. run Job A at 2000). We use Quartz now for scheduling work, but I think I am the only fan of the framework in my company. We have a good deal of background processing for the fulfillment and other back office processes, so a scheduled work queue is important (task scheduling).
Distilling this down, it looks like I need a system that can:
- Guarantee that the target will receive the requests
- Guarantee that the target can successfully process a request before it is removed from the queue.
- Multi-cast (fanout) the requests to multiple consumers
- Process requests at a pre-defined time
On top of that, I will need the other non-functional requirements usual suspects such as high-availability, scalability and security. Since this data will include customer information (PII, not PCI) there will need to be some level of transport security as well as access control.
I think that this is enough for now. I need to marinate over this before I start to look at possible solutions.