24 Jun 2021
Improving app resilience
Over the last year, many of us have seen families attending school and/or working from home; we’ve seen internet providers crumble because of the unexpected increase in traffic, bad connections, lags, outages, etc. Many times, we’ve had to turn off video in a conference call to improve bandwidth or drop WiFi and use the mobile internet connection. This isn’t much different than using your cell phone while you are commuting, or at a store, or somewhere where the connection drops or is unstable.
So, it’s important to design your applications and services with the expectation that things may degrade or fail, and you may need to design a strategy for what to do when that happens.
This article aims to summarize key concepts to be considered during the development and operation of your application. It does not aim to be a complete guide, there is too much to be considered, please be sure to check the "Learning more" section and keep researching.
Key areas
Let's start with the main areas to pay attention to while designing your application's architecture and planning your operating environment.
- Alerts: when something breaks, the application can alert/notify the proper response team.
- Tests: the application can self-diagnose problems by testing user workflows on a reasonable frequency
- Metrics: enough metrics to track and monitor the application's health, such as memory, disk, error rate, etc.
- Logs: output from code's execution with meaningful information
Remember that things can go bad at any time, therefore the more you automate the better. And some areas can be delegated, with built-in features of the platform you're running, such as managed services. For instance, instead of hosting your own MongoDB instance, that requires maintenance, monitoring, backup, etc., consider using AWS DocumentDB, Azure CosmoDB or Google MongoDB managed solutions.
Sample app
For the rest of this article, let's consider a simple Forge-powered application that shows 3D models. The app has some code to perform authentication, handle upload and translation for files; static content (HTML & JavaScript files) that runs on the browser, which is dependent on server-side (authentication) and on Forge (viewable information).
High-level architecture
Some early decisions can help the future of your application, it's important to at least consider how your application will scale over time.
A typical basic web application would have a server to host both code & static pages on the same server. The downside is that if the code fails, the static page will not be served, and the entire site is down. Having separate servers could prevent that, as the static pages will continue to be served to the client and could show a better error message or even provide basic offline features. In any case, a load balancer would define how many servers are required, based on defined policies (CPU usage, traffic, etc.), or which server to restart (based on error rate, latency, etc.).
Modern architecture could consider server coding running as a serverless and static page serving via CDN (content-deliver network), which can scale up and down much faster. All major cloud infrastructure providers offer that solution:
- AWS Lambda & Cloud Front
- Azure Functions & CDN
- Google Serverless & Cloud CDN
Our sample could start in a single server (cheaper), eventually migrate to multi-server (code & static), or to a serverless approach (scalable). This is possible, or easier, as long as the server and code implementation is developed to operate independently: for .NET or Nodejs, the code implements only endpoints, static pages are developed separately and are not related to the server code.
Client code implementation
Now that your architecture is ready, the static pages of our sample app will be served to the client. The JavaScript code running on the browser needs to somehow work even if your server code is not working. There are many different frameworks (React, Angular, Vue, etc.), each with its own retry mechanism. Let's focus on the Forge Viewer: if the loading fails unexpectedly. retry loading a model a few times after a few seconds, or consider having a cached version, as explained in this article: Disconnected workflows.
Server code level implementation
The code should be ready to handle problems that could occur with the API providers it is using, from connectivity issues to downgraded/down services. There are several libraries and tactics for implementing that in different languages.
The most basic feature is to retry a call that fails with error 5XX. In most cases of 4XX errors, the input data should change before retrying, such as 401 or 403, where a new authentication is likely required. The exception is 429 (rate-limit), but that's a specific case discussed later.
If you expect peak traffic, is also a good idea to have a queueing system. For our sample app, we may have a maximum amount of data to be transferred at a given time (due to memory consumption or throughput traffic). The app can enqueue that job and handle up to X number of files at a time. A side benefit of a queue: it retries if the process fails and offers a way to monitor your application or notify your user that the process is taking a bit longer than expected.
Here are a few language-specific libraries:
- .NET: Polly offers ways to retry, wait & retry, circuit breaker, and fallback. Hangfire queueing helps manage incoming calls.
- Nodejs: typical projects using node-fetch or Axios can use node-fetch-retry and axios-retry. Bee-queue helps manage income calls.
Forge Service
When connecting to Forge, here are a few characteristics to consider:
Queueing: when you request a Model Derivative job, a Design Automation job or a Reality Capture scene, the respective API will enqueue your request and process as soon as a server is available. This may take a few seconds to several minutes, depending on how busy the service is at that time as well as how complex your files are, although we aim to decrease queue times as much as possible. Your application should account for that waiting time in addition to the time that it takes to process your request.
Rate-limit: all services, not only Forge, have a rate-limit that varies per endpoint. Whenever your application gets a 429 response, it should wait and retry after X number of seconds (as per "retry-after" response header). Your application should be prepared to not hang or freeze while waiting. Please review Rate Limit documentation for each Forge service.
Retry: assuming that your code was working before, and there was no significant change on the input data, there is a chance the API request failed due to random issues, and a typical response is a 5XX. In those cases, it's reasonable for your application to retry after a few seconds up to a few times. One exception to this rule is the 504-error code, since this mean that the service did not respond in time. Consider a scenario where you called POST to start processing a job; the service did not respond in 60s and returned 504, but it did process the call in 62 seconds. The backend is now processing your job, and if you retry with another POST then another job will then be queued. Instead of calling POST to retry after 504, call a GET first to see if the job did get queued, e.g., GET Manifest for Model Derivative API.
Webhooks: Model Derivative jobs and Design Automation workitems offer the ability to callback your application when the execution is done, either success or failure. That way your application can allow users to keep interacting with it while the Forge service is executing the process. It's important that once Forge callback your application, it must return 200 immediately, and later perform actions. This is where a queueing system is important. The callback will be to your server, which needs to relay that to the client, probably with websockets.
Viewer: make sure to use version 7.39 or newer, which has a retry policy.
For the sample application, review the Model Derivative rate-limit. It's reasonable to have a retry policy for POST Job and a retry if the translation fails with status:failure, assuming that the input data has not changed significantly (e.g., user error). Webhook can notify when a translation is done, your application would then enqueue the message and use websocket (.NET SignalR or Nodejs socket.io) to notify the client to load the model.
Learning more
- Patterns for scalable and resilient apps by Google
- Cloud-native resiliency by Microsoft
- Achieving 99.99% uptime - a tale of Observability by Autodesk
- Tips and Tricks for Building and Testing Successful Cloud Applications and Services by Autodesk
- Autodesk Forge on AWS, by Autodesk