Foreword:

This post on microservices started innocently as a single post. They always do, don't they?

Before I knew it, it was 10k words and showing no sign of stopping and I was advised to split it up into a series, "nah, I said, it'll be fine".

I was then advised again to split it at least into two, the first few thousand in one and the rest in the other. I relented this time.

Perhaps it is just the nature of software, or perhaps it is 'scope-creep' (we've all been there right!) but, I've only gone and written a monolithic microservices post!

I see the irony here, so I've decided to break it apart into 'micro-posts'; thanks @adamralph :)

Each 'micro-topic' will become a 'micro-post' of it's own in our journey; so my apologies to those who wanted the full meal, your dinner will now be served as Tapas!


ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Distributed debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

Fr♣m ng pro▓ llem th e

In a monolith world with all the great modern tooling available to you, it is easy to load a project up in your IDE, set some breakpoints, and hit 'Debug'.

You can step through your code line-by-line and inspect its state.

Or you can write integration tests to assess the state of a system and to verify that the components of the system are interacting correctly with each other.

The single-threaded process has rock-solid reliability and low-latency for inter-process communications.

If you call a method on a class, you aren't concerned if it will reach that method-call or if will it take an unacceptable amount of time to get there.

Even at the boundaries of a process, things are often binary. If the database that runs the application or a file resource is unavailable, the application will crash or display an error.

You can check application logs, the system eventlog, the coredump or WinDbg type listeners to help you find, reproduce and fix the problem.

It's all in one place.

Next comes the multi-threaded process, to which nearly all of the above also applies.

How many of you have experienced race-conditions in one of your multi-threaded applications?

I know I have.

Even with the great tooling available, they are often subtle and hard to pin down. Introduce state mutation into the equation and you now need to worry about concurrency. There are locks and semaphores to deal with.

Is that class I used even 'thread-safe'?

Time and the order of execution between threads is no longer reliable, and the side-effects of this increase the level of complexity, of both your code and of your ability to reason about run-time state.

But they still have reliable, fast inter-process communication.

Now let's introduce asynchronous processing. Single-threaded and multi-threaded applications can be synchronous or asynchronous.

Now, within even a single thread, the order of execution is no longer reliable.

There is an excellent overview on the differences, but the take-away is that introducing multi-threading or asynchronous programming to your application can significantly increase the difficulty of reasoning about your system state as well as finding, reproducing and fixing bugs.

Yet they still have reliable, fast inter-process communication.

Now enter the distributed system, you probably see where this is going, don't you?

You are now in the world of remote procedure calls (RPC) across processes or networks. Each remote call is an order of magnitude slower and therefore, so is your application performance.

In our case, with ServiceStack, the RPCs use a request/response message-passing style.

Just as before, RPCs can be done either synchronously or asynchronously, but due to the performance of remote calls, using asynchronous processing is not really a choice, it's practically a must-have..

Now we have to ask ourselves - will calls be inexplicably duplicated? Will they even be invoked at all?

We have just lost cabin pressure. Communication is no longer reliable.

Down the rabbit hole we go, this stuff is hard and it gets weird, really fast.

Debugging this kind of system, which is executed piecemeal across processes and networks, just cannot be done with an IDE and access to a machine or applications logs.

Trying to reconstruct it in a test environment is an exercise in futility.

In the distributed system, debugging must be done in production.

This realisation forces you to approach the design of the system differently from the very beginning; if, you want to avoid creating the distributed equivalent of the Titanic that is!

So first up, logs have to be centralised to be able to reason about your system state, find errors and be able to trace the flow of events across each node.

Logging

Now, let's get specific shall we?

For ServiceStack, I've created a plugin on Github that builds upon the Request Logger to log to Seq.

ServiceStack.Seq.RequestLogsFeature

ServiceStack is fortunate to have some great people in its community and the plugin was quickly improved by a fellow member, thank you Richard.

Seq is an installable self-hosted service, with an HTTP API that is designed for log aggregation.

There is a readme available on the project, which I won't duplicate here, that covers setting up the plugin and using it in more detail, but I'll cover the basic code required to use it in your ServiceStack AppHost.

public override void Configure(Container container)  
{
   // Define your seq location, add the plugin
  var settings = new SeqRequestLogsSettings("http://localhost:5341");
  Plugins.Add(new SeqRequestLogsFeature(settings));
} 

That's it!

See everything!

The plugin now captures every incoming request to ServiceStack in Seq and is capable of logging every detail, not just the path but the headers, the request and response DTOs, execution timing, service exceptions and errors.

In addition, the logging detail can be modified at runtime, so when you need to debug in production, you can ramp up the level of detail logged. I'll come back to this again a few times in later posts.

So the first thing to note is that you are not storing plain text, you are storing structured data and it makes all the difference in the world.

With Seq that data is now easy to search, filter and aggregate using Seq's powerful query language and UI.

Logging in action

Having used Seq for a while now in other applications, I know how quickly it can help you identify and fix issues in your production systems. It's very easy to use and it comes with a free single-user licence. I highly recommend you try it out for yourself.

Tradeoffs:

Unless your logging receiver has high-availability (which Seq at this time does not have), we have just created our first piece of critical infrastructure as a single point of failure (SPF).

When any SPF goes down, bad things can happen, so throughout the series, I'll point these out.

Seq does however have a forwarder. This works against a local loopback address to buffer requests for forwarding onto your log server.

This helps with network unreliability by eliminating remote calls and the performance penalties that are associated with them.

An alternative is to use a UDP broadcast style of logging, like statsd. It may serve your circumstances better.

Our logging uses an http async 'fire-and-forget' style, so the network performance cost is reduced, but, if your service, network or Seq fails and you do not persist ServiceStack logging locally to disk or use a forwarder, you lose potentially valuable log data.

We could improve this plugin in future (PR's welcome!) to include a more resilient, guaranteed delivery to survive network outages; but for now we are not too concerned about this.

Our rationale will become more apparent as our design is revealed.

Debugging

The second part, distributed debugging, is a much trickier problem.

In distributed systems where there are many moving parts involved, the ability to reconstruct a timeline of events and state-mutations across your services is essential to being able to effectively debug and reason about system's state and overall health.

Enter the correlation identifier, our next plugin on the road to microservices

ServiceStack.Request.Correlation

Again this captures every incoming ServiceStack request and adds an Id to the request header. The service gateway's internal and external calls will pass this identifier on so that you can identify service to service calls from their point of origin in your logs.

Correlation

It is very early days for this plugin though. We need to refine it to be able to reconstruct a full map of service-to-service calls at each node, but for now, that is on our future road-map.

Being able to map out the calls is important for a couple of reasons which are worth mentioning at this stage though.

re'Curse' of the infinite loop

As your services become distributed, it is easy to create recursive calls, recursive calls, recursive calls, recu... :(

Recursive Call

Having a good timeout policy on all remote calls can help with this, but being able to add self-referencing checks in the correlation plugin can help with this by cancelling such requests.

Am I already in the call-map of the thing calling me?

Yes, byeee

the > never > ending > chain > of > calls > . > . >

Making remote calls easy and transparent to your services is really powerful, but it is also just as easy to abuse that power.

Long call chain

With each network hop, you increase the likely-hood of timeouts and the responsiveness of your API and their consumers suffers.

Having set limits on the length of call-chains can force distributed teams to be judicious in their use of dependencies (see left-pad!) and help foster collaboration between them instead to scale horizontally and keep the stack thin.

Further reading

There is some great research material if you are interested in reading more on this topic.

Google has a whitepaper on Dapper, a large-scale distributed systems tracing infrastructure, and there are a few implementations out in the wild to be found.

I am also looking at Vector Clocks of which you can find a C# implementation by the brilliant @kellabyte, but the fixed length limitations of this algorithm have led me towards Interval tree clocks as another possibility.

Another task for the future, is to capture requests at a lower level, like a proxy, so that tracing can work beyond ServiceStack calls to include data stores, files and external infrastructure resources.

Finally, there are also very interesting possibilities emerging from Joyent for run-time inspection and tracing using dTrace and containers worth keeping an eye on too.


That concludes part II, if I've missed anything, or you have your own great ideas or projects, let me know in the comments.

Also, we'd love others in the community to get involved with our plugins in Github so don't be shy.

:)

OK, enough of that for now, NEXT!

Let's do microservices!

next up: Service discovery, load balancing and routing