Written by Sam Kilada
In my previous article, I talked about serverless architecture and some of the benefits this platform has to offer for writing web apps. The scalability of serverless applications is hard to refuse, and so the temptation is to dive right in and start writing code on AWS Lambda functions or Azure Functions.
After diving in, however, you will quickly run into a problem: how are thousands or even millions of function instances going to talk to one MySQL database at the same time? If you are building a high-traffic application, this is going to quickly become a bottleneck. For example, the Azure Database for MySQL has an upper limit of 20,000 concurrent connections. Each Function or Lambda instance is going to have its own connection to the database, and so that’s just not a viable solution.
As with most software or architectural problems, there is more than one way to solve this. Perhaps a series of “middle-man” servers acting as the API gateway can be in charge of fetching the data and making it available to the serverless stack. However, if not architected correctly, such a solution could increase the complexity of the server architecture and introduce bottlenecks down the road. The goal is to avoid restricting data access to a few servers because this runs against the grain of limitless computing.
Realizing this, a question comes to mind: why are databases not as scalable as web apps? Why can’t we have unlimited database connections if we can have unlimited cloud services? The answer is, “Of course we can!”
The options are somewhat limited at the moment, but both Amazon Web Services and Microsoft Azure offer formidable solutions to this problem. A fair comparison would be AWS DynamoDB and Azure Cosmos DB. Each is a non-relational database, offering unlimited connections and high scalability. This is just the sort of thing that would work well in a serverless web application.
In this article, we’ll focus on Cosmos DB as our example. Both are fully managed platforms that offer a NoSQL data storage solution, but Cosmos DB takes its query API to the next level. Though the data stored in Cosmos is document based, a SQL API is available for accessing the stored data. With the Cosmos SQL API one can easily do something like this:
SELECT * FROM users u WHERE u.city = “Portland” AND u.items > 7 AND u.zipCode in (...)
Such a query might raise concerns, but there is nothing to fear here. Microsoft promises a less than ten-millisecond read latency. This is possible because Azure indexes every property in every document, automatically, which means you no longer have to worry about accessing your data with a high-efficiency query. This is a huge benefit, with the only real trade-off being that it is still a non-relational database, so “JOIN” and other useful relational queries are currently not possible.
Though indexing makes querying quite fast, partitioning still needs to be planned out. Cosmos DB will spread data out to multiple servers as it grows. If data that is relevant to one user is spread across multiple servers, this will result in a slower query. The goal, then, is to find a partition key that allows for structured scaling as the database grows.
For example, if you are building a news website and you want to store user comments in Cosmos DB, it would be wise to partition the comments by the article’s ID. This does not mean that Cosmos will create a database server for every article; rather, Cosmos will ensure that all comments for a particular article remain on one server when scaling out.
What also makes Cosmos DB so attractive is that the data can easily be replicated around the world. Both read and write are available in any region within a Cosmos DB database, which ensures that users all over the world have equally fast access to application data.
A common worry in web development is that simultaneous changes to data will result in the corruption or inaccuracy of the data. For example, imagine a wiki site built on the serverless architecture and utilizing Cosmos DB, and that two users happen to be editing the same article at the same time. When one user saves their changes, they will overwrite changes made by the other user without realizing it.
It turns out that this problem is not unique to the serverless platform; it is simply more likely to occur because the number of processes accessing or updating data is exponentially higher than would be the case in a traditional server architecture.
Cosmos DB has a solution for this. There is a property present on every document (“eTag”) which can be sent back up to the database when a change is made to a document. If the eTag doesn’t match the current eTag in Cosmos, it means that another process has written to the document, and so the change is rejected. This means developers need to account for these scenarios and show a meaningful error message to the user.
AWS DynamoDB has a similar feature called “optimistic locking”. This method uses a version number (much like the eTag in CosmosDB) to allow DynamoDB to accept or reject requests and prevent the overwriting of data.
It might seem like a scalable data management system would be quite expensive. This used to be the case; originally, when CosmosDB was released, there was a minimum cost per collection, which meant that if you had users, articles, and logs collection, you would already be looking at around $75 without even making any queries. Microsoft has since changed its pricing structure to allow database-level provisioning, which means you can pay one price for all collections in your database, somewhere around $20 to $25 per month (to start).
DynamoDB has a more attractive price tag than Cosmos. Using an example from the pricing page, Amazon claims that 3,550,000 writes and reads in a month would cost around $5. Note that this is for just one table in one region. The cost multiplies based on the regions added to the plan (as it does for CosmosDB). Being cheaper than Cosmos and offering somewhat similar capabilities, it is definitely worth considering.
While it means a significant change in architecture, serverless computing remains an attractive solution for scaling on the web. The tools we have at our disposal have opened a wide array of possibilities that could only be dreamed of a decade ago. Now, thanks to scalable databases, we can provide end-users with an experience that is sure to be consistent and highly available regardless of how much traffic there is.
Written by Sam Kilada