At Didomi, we recently set up a first version of our analytics/reporting platform for business metrics (not pure tech/operations metrics). We wanted to start collecting business metrics and have an easy way to collect, store, query and graph them that would be 1) adapted to our size (we are still a pretty small company), 2) cost-effective (see #1) and 3) flexible and scalable enough that we won't need to revisit it in 6 months.
What events do we collect?
We help companies get in compliance with privacy regulations (GDPR, ePrivacy, etc.) so our main source of events is our SDK that gets deployed on websites or mobile apps and that displays banners, forms and other elements to manage the user privacy. We wanted to be able to follow events like how many times our SDK is loaded (ie page views), on what websites, how often users interact with it, etc. That will be used both internally and to build client-facing dashboard eventually.
Want to ensure compliance with privacy regulations?
What we wanted to build is almost commoditised these days. The workflow has two parts:
1) An API to collect events from users that get converted into a stream that can be consumed internally by various services. The output of the stream is stored reliably for long-term use and batch processing.
2) A query engine and a front-end to run queries and graph the results (aka a BI tool).
I have used Hive/Spark before (I am a big fan of Qubole and Databricks, for instance) so I considered these options: they are incredibly powerful but also very complex and costly to operate and definitely not adapted to our size. We rely on AWS for a lot of our operations so we started looking at what solutions they offered. The response is: a lot. While it sometimes takes some effort to understand what each product does, they have come a long way in offering solid analytics products.
We ended up setting up the following infrastructure:
- Events get sent to our API (a standard Feathers Node.js API server running on Elastic Beanstalk / Elastic Container Service) by our SDKs
- Our API sends events as JSON to AWS Firehose
- Firehose batches the logs (up to 5 minutes or 5 MBs) and sends them to S3
- AWS Athena is used to run SQL on our events stored on S3
- Redash (not an AWS product this time!) allows us to easily manage our queries and generate internal graphs/dashboards with Athena as the backend
The setup was pretty smooth and we did not encounter any major difficulty. We discovered some limitations of the AWS products though and they seem to be pretty young at this point:
AWS Firehose is not very flexible and simply writes the data that you send to S3. You can run a Lambda function on it but we have not tried that yet.
It does not allow to transform data to another format automatically (for instance, I'd have preferred storing Parquet rather than JSON) and the S3 paths that it uses are not compatible with Hive partitions so that it is not possible to partition the Athena tables. With timestamped logs, that will make the queries costly pretty fast.
2) Athena / Glue / Quicksight
That seems to be the ideal trio for this type of infrastructure on AWS and we tried to use all of them. Athena is the query engine, Glue would host the metastore (ie the schema of our databases and tables) and QuickSight could be used to create graphs and dashboards. Yet, Glue and Quicksight are not available in our region. Although our API servers are distributed, we want to store data in Frankfurt (we do GDPR compliance, remember, so storing data in Europe is important for us).
3) Athena documentation and references online
It is pretty hard to find detailed documentation on Athena. It seems to be using Presto as its engine (https://aws.amazon.com/athena/faqs/) but a lot of the documentation from AWS is somewhat incomplete (dealing with dates/timestamps, for instance, is tricky) and the product is recent enough that there isn't much to find on StackOverflow/Google in terms of community help. You end up going through Presto/Hive questions out there and do a lot of trials until you find a function that is actually supported by Athena.
If we look at the history of how AWS products evolve, we can see that we are really at the beginning of the story for Athena: a very solid core that does what it has to do very reliably but not much more than that. They will surely keep adding more features over time.
Is it worth it?
Overall, implementing the whole setup is really easy. It took us less than a day to get it working although it is important to note that we had some prior experience building similar systems and knew where we were going. We use CloudFormation to automate all our stacks on AWS so our final setup is fully managed and replicable in other regions if needed and we are likely to operate in a few different regions to keep user data as close as possible to the end user.
The pricing model compared to the cost of a Hive or Spark-based solution is very (very!) favorable for a company of our size. We pay a few dollars a month for all of that, no upfront cost and we know that the price will scale naturally with our growth.
At our size and for the number of events that we process daily, we could have gone with a simpler approach (Postgres, for instance) and keep Athena/Redash as our querying layer. That being said, I really wanted to give Firehose a try and I already had a good amount of experience with this type of events management so it did not seem too over-engineered and I knew that it would scale easily and reliably.
That's our story. The full infrastructure seems somewhat simple although if you think about what each component is doing and the scale and depth that are behind each of them, you will realize the amount of work that it would have been for us to build that without AWS services. It amazes me how much the industry and AWS have grown. I have been an AWS client for 7+ years now at various companies: when I started using them, they were barely doing anything else than EC2 and RDS and, while that was very powerful already, it was "just" virtual servers.
How easy it has become to build, orchestrate and automate a reliable, distributed and scalable infrastructure for real-time APIs, analytics, AI and so much more nowadays is crazy. It also helps startups like Didomi build infrastructure that we wouldn't have been able to afford back then and fosters innovation more than anything else.
PS: I am obviously biased because my experience with other cloud vendors is somewhat limited but this is not an "AWS against the others" blog post.
Want to find out more? Get in contact with us today!