Here’s how we implemented a combination of a database, data lake, and data warehouse to make customer reporting much faster for our client’s SMS marketing platform.
This is the fifth article of our Big Data in the Cloud series. You can read the first four here:
- Big Data in the Cloud: Why Cloud Computing is the Answer to Your Big Data Initiatives
- 8 Things You Need to Do Before Kicking Off Your Cloud Big Data Initiative
- SQL vs. NoSQL – what’s the best option for your database needs?
- How data lakes, data warehouses, and data marts fit into your cloud storage architectureeTo get notified of our future Big Data in the Cloud posts, click here to sign up for our email list.
In our prior blog post, we reviewed data lakes, data warehouses, and data marts and how they can all fit into your cloud storage architecture. You can implement multiple technologies to create the cloud data architecture that will meet your needs.
We’ve done a good amount of big data projects for our clients, and one particular case study is a perfect example of how databases, data lakes, and data warehouses can all be integrated. So let’s check it out.
Faster customer data processing for a top-tier wireless carrier’s SMS marketing
We developed a top-tier wireless carrier’s SMS marketing platform, which delivers over 4 million messages a day to subscribers. Many of the messages sent by the client contain unique URLs linking to mobile web content, and our client was looking for a better way to measure how subscribers interacted with these text messages and links.
With so many messages being sent each day, we had to take a creative approach with the design of our client’s cloud big data environment to be able to measure all of these interactions.
So we created an architecture to log each outbound message, the unique URL included in each message, and click behavior metrics on a massive scale.
The overall architecture consists of:
- Amazon RDS PostgreSQL database for transaction processing.
- S3 data lake for storing log files.
- Redshift data warehouse for reporting and analytics.
Let’s dig deeper into each component.
The Database – PostgreSQL on Amazon RDS
We implemented an Amazon RDS instance using a PostgreSQL engine for our database.
As mentioned in our SQL vs. NoSQL blog post, PostgreSQL is an ACID-compliant (Atomicity, Consistency, Isolation, Durability) transactional database. We chose to use this type of database to ensure that the SMS transaction data – messages sent, message delivery status, clicked links, and messages received – were valid so we got the true story of the customer journey.
If a system error happens to occur while data is collected, the PostgreSQL database would roll back the changes to ensure that the customer behavior data was valid (atomicity). Additionally, because of the high volume of the SMS platform, many transactions are happening concurrently. The Isolation aspect of ACID will guarantee that the data is accurate when all of these writes are occurring simultaneously.
As you can see, ACID compliance is very important in maintaining accurate customer behavior data, and PostgreSQL does a great job of this.
The Data Lake – S3
The architecture of the SMS messaging system involves several services with independent but related datasets. All of this data needs to be aggregated to produce customer reports and support real-time queries from a reporting application to get a complete picture of the user’s journey. To gather all the data, JSON transaction logs from the independent services and data from our PostgreSQL database are saved to a single S3 bucket – the data lake.
Each time a file is saved in S3, a message is posted to SQS (Simple Queue Service) queues. The queues are periodically checked with a Lambda and when there are new files in the queue, an ECS (Elastic Container Service) task starts. The files are then loaded into Redshift staging tables (more details about our Redshift setup in the next section) and processed into their final format using SQL commands inside the ECS task.
The data pulled into Redshift is aggregated for specific reports, but often there is quite a bit more data in the original transaction log files stored on S3. This data might be helpful for deeper analytics, troubleshooting, or anomaly detection. Using tools like AWS Athena or Apache Hive, we can run complex SQL queries against the data where it lives, right on S3 – no need to load it into a separate database like Redshift or PostgreSQL.
As you can see, the S3 data lake is a low-cost, flexible, easily-accessible repository to store customer data, regardless of whether it will be analyzed now or in the future.
The Data Warehouse – Amazon Redshift
The system architecture that was designed prior to moving to the cloud used a traditional ETL process. Data was extracted from the transactional database, transformed for our client’s reporting purposes, and then loaded into a traditional data warehouse. This process took several hours to complete and occurred on a nightly basis.
The implementation of S3 as our data lake and Redshift as our data warehouse significantly improved this process. Instead of doing nightly exports from the database to the data warehouse, the aforementioned transaction log files are created and sent to S3 every five minutes, which kicks off the process involving SQS, Lambda, and ECS mentioned above.
That data then gets loaded from S3 into Redshift, and queries are run to aggregate or transform the data for specific reports. These reports include aggregated messaging statuses, delivery statistics, click-through rates, and many more metrics, by various campaign types across date ranges.
So instead of loading 24 hours of data once a day, we now load data every 5 minutes, giving our client a near real-time view of their customers’ behavior.
Redshift is an excellent fit for our client since it can support datasets with billions of records in a performant manner. The current 6-node cluster handles loading of over 9 million records per day and averages 20% CPU usage evenly distributed across the nodes.
Additionally, there are times when we have to make schema changes, and Redshift’s scalability combined with our S3 data lake setup allows us to do this easily.
With traditional data warehouses, making a schema change is often a long, painful process that requires long-running queries to alter the table and reload new data. In our case, with S3 as our data lake, we can simply:
- Delete all the data from Redshift.
- Make the schema changes to Redshift quickly and easily, because there is no data there.
- Spin up a bunch of additional Redshift nodes (usually to 20+).
- Reprocess every file in S3.
- Scale back the number of Redshift nodes when the file reprocessing is complete.
Instead of taking days or weeks to change the schema, we can achieve this in a few hours.
And as the dataset continues to grow and demand for reporting increases, we can increase the number of Redshift nodes to meet these requirements.
The architecture that we designed allows us to capture and analyze billions of data points in a quick, efficient manner. The ACID-compliant PostgreSQL database ensures that the customer behavior data is valid, S3 is the data lake that stores the log files that are used for analysis, and Redshift provides fast data reporting at scale.
Conclusion
As you can see in the example above, you can design a cloud architecture that includes a database, data warehouse, and data lake to store and analyze data at massive scale. The architecture we designed helps our client better understand how recipients are interacting with their messages, measure the success of SMS campaigns, and gain insight into planning future campaigns.
What are your thoughts about our architecture? Is there anything that you would do differently? How have you designed your cloud big data architecture? Post your comment below, we would love to hear your thoughts!
Like this post? It likes you too. 🙂 Please share it using the share buttons to the left. Then join our mailing list below, follow us on Twitter @thorntech, and join our Facebook page for future updates.
[thrive_leads id=’26507′]