APS Blog: AWS Messaging Services

Thumbnail for the Messaging Services blog post

AWS Messaging Services

Published 2024-09-26

The crux of technology is that it's always changing. Hopefully for the better, but it leads to having many options to solve any given problem. You then have to sit down and evaluate your choices, using whatever rubric you prefer, even if you've done the same evaluation in the past because things have now changed.

This extends to AWS services. They are building blocks and you get to be the lucky one to decide which ones to use, even though some of those blocks look very similar. There is documentation on how to use them, there are whitepapers on the suggested way to use them, and there is no shortage of articles and posts on the internet of different ways to use them.

I try to avoid saying that there are articles out there on the wrong way to use them. That claim is hard to back up because the needs of every system are different. There are definitely more accepted ways of using these services. If you're building a web application chances are it's going to have some sort of persistent data store suitable for OLTP workloads, probably some flavor of RDS, but there are plenty of other options that could work equally as well. You could use S3 and SQLite if you were so inclined. Store the file on S3 and download it on every request to query it. In the case of modifications, store the resulting file back on S3.

Now, that's starting to become dangerously close to being flat out wrong, but it could theoretically work.^[1]What's wrong with incurring network traffic for the size of your entire database for every request? The point is, for any given system you could build it 10 different ways, and as the ones designing and building that system we are responsible for finding the best way given the circumstances.

A set of services that this often comes up with, and I've been asked more than a handful of times about, are the messaging services in AWS. The most common of which that come up in conversation are Simple Notification Service (SNS), Simple Queue Service (SQS), and EventBridge. Lets talk about when you should use each one.

The Playing Field

Lets start with some definitions:

SNS - A pub/sub service centered around the idea of a topic. You publish to a topic and 1 or more subscribers receive messages from the topic. SNS divides subscribers into two categories: 1. Application-to-application (A2A) which are other AWS services or applications that can receive notification via HTTP and 2. Application-to-person (A2P) which includes text messaging, push notifications, and email. By default topics are not ordered, but you can configure them to be FIFO.
SQS - A queueing service where you define one or more durable queues that you can push messages to and poll messages from. While some AWS services provide automatic configuration of some of the plumbing to receive messages from a queue, consumers use the API to poll for any available messages in the queue. A queue may have many consumers, but only one consumer will process a given message. Similar to SNS, queues by default do not provide guaranteed order^[2]I've always loved that a service named after a FIFO data structure is not FIFO by default., but you can configure them to be FIFO.
EventBridge - A service that encompasses a few different concepts around event driven architectures, including Event Buses, Pipes, Schemas, and a Scheduler. Primarily suited for A2A type of work, its first iteration was originally CloudWatch Events and came out many years after SNS. In terms of order, Event Buses themselves do not guarantee any order, however pipes will guarantee order if the source does.

All of these services can be used if your only goal is to get a message from one application to another and don't want a direct interface between the two.

All three services could be used for basic A2A needs

You can coerce these services into solving the same problem, but they are meant for different problems. This leads to confusion and ultimately a refactor to our system down the road when we realize we chose the wrong service for our use case.

SQS

While pub/sub is an architecture, a queue is an actual data structure, and you're only going to use it when the patterns it facilitates are needed. SQS is no different here, it's just a managed service that wraps the concept of a queue. While by default it is not FIFO (mainly for scaling reasons), you can configure it to be FIFO if you need that guarantee. Without being configured as FIFO it may appear that it's preserving order, and 90% of the time that may be true, but you'll want to make sure to create a FIFO queue if that's really what you need.

Queues are useful when you have one or more producers, things that are sending messages to the queue, and a single consumer, something that is taking messages off of the queue. It may be better to think of having a single service as the consumer. There may be multiple instances of that service consuming from the queue, but they are all performing the same activity. A classic example is to think of a bank. People (messages) enter in line (the queue) and are helped by bank tellers (service instances). All of the bank tellers can perform the same duties, and ultimately it doesn't matter which one a person interacts with.

Two different services on different platforms send messages to a queue for a single service to consume.

Multiple services can send messages to the queue, but only one service, even with multiple instances, should read from that queue.

This rules out using SQS when you need a "fan out" pattern. Where a message is produced and multiple consumers get notified of it. Do not try to use it for that, it will not work by itself^[3]You could then have the consumer send the message to multiple other services, but at that point just use the appropriate messaging service..

Queues are a great way of protecting downstream resources. Take an audit log for example. You may have multiple different services all producing messages that need to be persisted in a centralized audit log. Any of these messaging services could decouple the message producers from the persistance layer^[4]Cut the dumb industry speak, we just don't want 50 machines holding a connection to our database., but a queue is more useful here because we can now protect that layer from being overloaded. So long as our audit log is permitted to be eventually consistent, producers can send as many messages as they need in as short of a time span as they need, while the rate of writing to the audit log stays constant (or at least has some upper bound). SQS is durable, so if for some reason the audit sub-system is down, the messages will wait in the queue until they can be written.

Reasons to use or not use:

You have one consumer
You have multiple consumers
You need to control utilization of downstream resources
You need "immediate" message handling
You need to send messages to people

Common scenarios I've seen it used in:

To decouple scalable resources from non-scalable resources. The reason why something isn't scalable is varied. It could be due to proprietary technologies that involve a license, and therefore running multiple is cost-prohibitive. It could be a database that is technically scalable, but the cost of doing so outweighs the time factor.
Job scheduling where the time to complete a single job is unknown but jobs must run in order.
Job scheduling where jobs only run at a certain time of day (like after business hours) but the specific jobs that must be ran are dynamic and based on external factors. An example here is where a batch update job must be done if a certain table in the database was updated, but it is updated infrequently by some API. The API may send a message to the queue to execute the job later that night, since it knows if it had to update the table.

Queues are simple data structures, but their usage can be surprisingly complex, which is why there is so much research on queueing theory. This extends to the SQS service and some of the functionality it offers. It's easy to get running, but can be harder than you expect to get right.

SNS

SNS is truly a pub/sub model, and as such can be used as an event bus. But if that's also what EventBridge does, then which one to use?

First, we should keep in mind the timeline of when these two came about. SNS was introduced fairly early in AWS history, back in 2010. Where as EventBridge (as CloudWatch Events at the time) wasn't introduced until 6 years later in 2016^[5]EventBridge as we know it now was released in 2019. For years SNS was the go-to service for pub/sub based systems (in AWS), but now it's largely superseded by EventBridge which contains many more features useful for that design.

As an aside, a guy named Hidekazu Konishi has a great page showing the history and timeline of AWS services if you're interested in that sort of thing.

Where SNS has a leg up on EventBridge is its built in A2P functionality. But, that doesn't mean we should use it for everything when we need to send a message to a person or group of people.

The biggest reason for that is you cannot customize the body if you're using it deliver a message to someone via email. So if it needs to look pretty, like for something sent to end users, then you should use a proper email service instead^[6]There are plenty, including SES if you want to stay in the AWS ecosystem.

At this point in time, the primary usage I see with SNS is for sending internal and/or operational alerts, especially because it has native support with CloudWatch Alarms. You can easily monitor a metric (or combination of them using composite alarms) and notify you and your team if a threshold is breached. Individual team members can decide if they want to be pestered by email, text message, or both!^[7]Queue The Office WUPHF reference

Metrics get sent to CloudWatch and CW Alarms send notifications to SNS which are forwarded to Operations

Nearly every service sends metric data to CloudWatch, which can be configured to alarm and send to an SNS topic.

Reasons to use or not use:

You have one consumer
You have multiple consumers
You need to control utilization of downstream resources
You need "immediate" message handling
You need to send messages to people

Common scenarios I've seen it used in:

Alerting operations team members of metric alarms
Fanning out an S3 Event Notification prior to the introduction of EventBridge.

In short, use SNS for operational alerts, that's it.

EventBridge

At this point you've probably realized when to use EventBridge. It's the thing we reach for when we actually need an event bus! If you are migrating an existing system from on-prem into AWS then you may want to consider Amazon MQ due to it's compatibility, but if you're trying to go all-in on native AWS then you'll want to build around EventBridge.

There are a decent amount of features and flexibility encapsulated in the service.

Default Event Bus - The event bus that many AWS services natively send events to
Custom Event Buses - One or more event buses that you can define
Partner Event Buses - Event buses that allow you to receive events from various SaaS vendors (there's almost 40 at this point)
Pipes - A mechanism to wire up point-to-point integrations for a set of defined services.
Scheduler - Used for all your cron job needs
Schemas - Allow you to define event schemas and produce bindings for Go, Java, Python, and Typescript

As well as a handful of other capabilities like event enrichment and transformation that you'll find in most event bus technologies.

One design that I often utilize is using Pipes to send events from DynamoDB Streams to a Custom Event Bus. Lambda Functions receive the events and make the appropriate updates back to the DynamoDB table. This provides a way to replicate data to multiple related records when the source record changes.

Diagram of using pipes to send DynamoDB Streams events to Lambda

Record replication with DynamoDB Streams and EventBridge.

If it's not already apparent, you need to be very thoughtful when using this pattern. You can easily create an infinite loop.

Reasons to use or not use:

You have one consumer
You have multiple consumers
You need to control utilization of downstream resources
You need "immediate" message handling
You need to send messages to people

Common scenarios I've seen it used in:

Inter/Intra-application messaging
Redirecting stream data (such as DynamoDB Streams, Kinesis, Kafka, etc.) to multiple consumers.
Integration with third party applications to perform actions based on events in their platform.
Job and reporting scheduling
Operational tools that respond to events in an AWS account, like sending messages to a Slack channel when a CodeBuild build fails.

Recap

EventBridge is going to be your go-to service in most cases. If there are some special considerations that require a queue data structure then SQS is going to be the obvious choice. And in general, use SNS for operational notifications only.^[8]I've made my claim and I'm sticking to it.

Happy building!