The build vs buy decision is one that many product and engineering teams have to consider when they’re evaluating how to deliver new features to their customers. What is core? What adds value? What is complicated to the point of distraction? And most importantly, what is easily offloaded to a specialist?
At Cronofy we’ve focused on becoming the experts in calendar synchronization and scheduling. This has enabled us to build deep expertise and experience of running the kind of system that can deliver the required functionality reliably. Building a similar system isn’t beyond the wit of a similarly motivated and technically strong team. This isn’t rocket science but the problems are thorny and operational experience and capability counts for a lot.
This article explains the technology and decisions we’ve made whilst building Cronofy. We’ve shared this level of detail so that you and your team can make an informed decision about whether using a service provider like us will deliver value. Primarily by saving time and allowing your team team to focus on other, higher value, deliverables.
If you’re making the buy vs build decision then you’re very likely not having to deal with different company’s applications accessing end-user calendar data as well as tracking usage, billing etc. We won’t cover that functionality in this piece.
At the heart of Cronofy is a data cache. On one side we synchronize end-user calendar data, on the other we expose data from the cache via our API. This affords us many benefits but does add a level of complexity to our architecture.
Calendars are inherently distributed and asynchronous.
So the natural model of calendar data lends itself to a caching approach as calendar software is already operating from a cache.
One of the challenges with architecting distributed systems is coping with failure as a normal state of affairs.
Exchange servers will be slow and behave in peculiar ways. There were 127 versions of Microsoft Exchange in the wild the last time we counted. That’s a lot of chance for differing behaviour and quirks with how different types of events are processed.
Keeping the cache up to date varies by calendar service provider. Most have some kind of push notification system that will inform you of changes so rather than polling, you can just get changes as they happen rather than wasting compute cycles on polling when there are no updates.
CalDAV servers like Apple don’t provide push notifications so we have to poll them on a scheduled cycle.
Unfortunately these aren’t 100% reliable so not only are we dealing with the failure we need to poll as well. We do this on a regular basis to ensure any changes that may not have resulted in a notification are captured.
Worse still is when the entire calendar service goes down (e.g. Google Calendar Downtime). Again this is when the cache comes into its own. We can still provide the last known version of the calendar data without disrupting the applications consuming it.
We operate a store of events (database on Amazon Aurora) and jobs that are queued (Amazon SQS) in response to the push notifications from the calendar services. We also have cron jobs that push broad sync jobs into the queue to catch anything that has been missed. Retries are a must but the strategy varies according to the type of error received.
One of the key challenges we’ve had to face is the non-descript failures. For example, does a
500 error back from a server mean it’s having a temporary problem or that the credentials are invalid? In our experience both can be correct.
This means we’ve had to introduce the concept of quarantining a synchronized account’s credentials –
Profile in Cronofy parlance – to back off from retries but still preserve the credentials. This allows us to reinstate when the server becomes available without requiring the user to do anything.
We run an autoscaling pool of synchronization workers (running on Amazon EC2 nodes using Amazon EKS to manage the Kubernetes clusters) because the sync load isn’t even. 9am on Monday morning across time zones is the most active time of the week for calendar updates which in turn correlates to maximum traffic for our sync workers. But, there are also times when calendar services are just a bit slower than normal, forcing us to eat up compute cycles waiting for a response from another system. The autoscaling group allows us to process as many of these updates in parallel as we can, ensuring our cache is as up to date as possible.
As described above, the push notification system isn’t reliable or available in some cases. This means we can’t rely on these change notifications to update our own change notifications. Instead, another set of jobs monitor for changes in the event store and decide when to notify a change to any interested system. The system sends a notification when something changes in the cache rather than relying on third party systems to tell it.
We use this to deal with some of the noisier updates and flatten them into a single update. When updates are made to multiple fields on an event, this can sometimes be notified as individual push notifications for each by the calendar server. We add a small delay to any notification to capture any additional changes before notifying downstream systems. This also makes recovery from failure far more robust and less prone to bursts of changes for downstream applications.
One of the data decisions we took early on was to differentiate between events that we source from a user’s calendar and the events that applications create. This, importantly, allows us to have different permissions on both classes of events.
A key aspect of working with calendars that terrified us from the start is the opportunity to wipe someone’s calendar that the standard calendar permissions give you. So, we built a layer of protection that prevents applications from doing this.
The vast majority of use cases need what we refer to as
free_busy_write permission. If your application is scheduling a meeting it only needs to know free busy information about what’s in someone’s calendar but you do also need the ability to write the event. By treating events differently by the source we’re able to prevent unnecessary data access yet still support the use case.
Better still, applications that use this model are safe from ever deleting or modifying events they shouldn’t.
So far we’ve focused on how we keep our cache synchronized with a user’s calendar service. The other aspect of running a service is how to manage creating and updating events from applications. We had to make sure that the model provided by the API was consistent with the distributed nature of calendar services while, as much as possible, hiding the complexities from client applications.
The model we’ve used is an upsert approach. When you create an event via the Cronofy API it is not immediately created in the end-user’s calendar. Instead a job is queued for later processing. This is possible because we require your application to provide the `event_id` and is why the response to a create or update event operation is an HTTP status code of
202 Accepted. The event is then pushed to our ‘partner’ event store and then to the end-user’s calendar.
By not directly inserting the event into the calendar in-line, we can accept this very quickly and not block application calls while the user’s calendar service is processing the event. This is especially powerful when something prevents us from immediately updating the end-user’s calendar.
Whether this is a temporary loss of credentials or an outage on the calendar server, we have implemented a queue and retry protocol. This automatically processes all outstanding jobs when the connection to the calendar service is available. This eventual consistency model is key to robustness in failure prone environments while dramatically simplifying the implementation in the application.
The other way we’ve optimized these updates and simplified implementation is to only push changes when something has actually changed on an event. We use a pattern called event sourcing to record the changes that are made to calendar events, and indeed to any entity in our application. This allows us to decide at the point of applying the update whether this will actually result in a change to the downstream calendar. If not, we don’t attempt update.
This, again, simplifies the implementation for applications whilst making the calendar sync more robust. Your application just needs to submit to Cronofy the latest version of the event and our infrastructure only makes updates if necessary. It is very easy to very quickly hit rate limits on the calendar service APIs with naive update models. Thus optimization of these interactions is critical to ensure continuity of service and prevent temporary outages.
The same approach is taken in the opposite direction when processing updates from the calendar. When our service processes an update notification for an account, it becomes trivial to only trigger a push notification when data has actually changed. This in turns saves client applications from processing phantom changes.
The majority of use cases requiring calendar sync are to drive some kind of scheduling operation. Delivering scheduling functionality to thousands of developers has forced us to design our scheduling API end points to make it easy for applications to deliver the simplest scheduling operation but also easy to extend. What starts as a simple free time lookup for one person quickly escalates in complexity as users reveal all of their implicit expectations on how a scheduling service should help them.
People will want buffer times between events. They’ll want their availability driven by the contents from more than one calendar. Perhaps their available/working hours need to change depending on what is being scheduled. Maybe they need a meeting room to be available or perhaps one of three possible colleagues involved. We’ve been forced to confront these challenges up front and thus these features are trivial to include as users reveal the complexity of their needs.
Performance is of course another key concern. Trivial – one person – availability look ups that are quick enough with a few hundred people in your database can quickly become too slow when the population becomes thousands. Add complexities like buffer times around events, different working hours and event sequencing and you quickly get into N+1 and N-squared complexity problems.
Combined with our event sourcing model, we use a CQRS (Command Query Responsibility Segregation) approach. Using the event stream, we create and maintain denormalised views of event data focused on enabling high speed lookups and the set intersection math required to find free times.
This is a constant process of iteration and improvement as we support new scheduling use cases and add refinements to existing ones. Working with thousands of developers give us a unique insight into how people can, and do, manage their time. Distilling that into a reusable product is why we exist.
Our UI Elements are a great example of where we’ve worked with our customers to build reusable scheduling UI. This encapsulates best practices from across multiple sectors to save you development time and give your users a tried and tested experience.
The goal of this article is to give you a clearer idea of some of the problems we’ve had to solve in order to deliver reliable, high performance calendar sync and scheduling infrastructure. Should you decide to build this yourself, hopefully this will have given you some valuable insight to help focus and properly scope your project.
If you’d like to explore how Cronofy’s years of building and running our service can help you deliver scheduling features in your application, we’d love to talk.