00011 : Service discovery, load balancing and routing

00011 : Service discovery, load balancing and routing

ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Distributed debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

In the previous post, I covered the challenges associated with debugging and logging RPC calls across distributed systems. Now let's turn our attention to how those RPC calls work in your services.

This boils down to the fact that services using RPC calls rely on services that are in another process.

As a system grows, and services are added or removed, keeping track of what services are available and where they are becomes an issue.

You could hard-code in each service the locations of the services it depends on, but that tends to break down once you have two services and need to add the third!

Once you need to run multiple instances of the same service, or use containers and elastic scaling, suddenly DNS propagates too slowly and you don't know where everything is.

Re-deploying your service every time another service on which it depends is updated or moved means you must decouple these type of dependencies between services.

It quickly becomes apparent that you need a more dynamic solution.

You need service discovery.

O Services, Services, wherefore art thou Services?

There are a number of tried-and-tested methods for discovery to be found in DHCP, Bonjour, uPnP, SSDL and DNS-SD. For web-based services, UDDI and WS-Discovery have come and - for the most part - gone.

Newer solutions like Zookeeper, Etcd and Consul have emerged to offer service discovery.

Gateways like NGINX also provide routing options which can be used for decoupling service-to-service calls.

Enterprise Service Bus systems like NServiceBus and MassTransit also can be used in a pub/sub messaging pattern to decouple service-to-service calls.

I've mentioned just a few but there are many more. You have a lot of options here, so how do you choose?

Let's first briefly cover some different patterns before I cover what we have chosen to use and why.

Centralised Registry vs. Self-Discovery

There are two common patterns that you find in solutions for Service Discovery.

The first is the service registry, a centralised database that stores the location of a service.

The second is self or auto-discovery where there is no central database and is often found in zero-configuration networking. Instead, clients use a variety of approaches to broadcast packets across a network to request a remote service and wait for the required service to respond with its location.

The service registry is another single point of failure (SPF) in your infrastructure but can provide more operational control. When used with server-side discovery, which is often found in gateways, it can completely decouple any discovery logic from the services.

Zero-configuration networking can be generous on security within networks to permit devices to 'just work' but can be more challenging to secure as systems span networks. It is often more suitable for smaller networks (uPnP, Bonjour etc.).


There are four common types of service-to-service communication.


  1. Point-to-point : services talk directly to each other.

  2. Gateway : acts as the middleman, handling the routing of requests and responses between services.

  3. Gateway Request : the responding service replies directly to the calling service rather than return through the gateway.

  4. Message Queue: services publish messages to a queue, the responding service subscribes to the messages published and in turn publishes its response to the queue for the original service to subscribe to.

Point-to-point involves the shortest route so is often the quickest but requires each end-point to take a dependency on your discovery mechanism.

The gateway can decouple many concerns from your services, handling not just routing, but caching, front-end to back-end bridging with HTTPS termination, transport conversions like HTTP to TCP/IP, formats, aggregation and load-balancing, to name just a few.

The message-queue pub/sub model is slower and is more suited for longer running processes.


For service registries only, the registration can be handled by each client directly or by the server.

As with server-side discovery, server-side registration completely decouples registration from your services.

Further reading

I've only really scratched the surface on the above keeping the explanations as brief as possible, as I want to get on to some specifics, but you can find a much better, more detailed overview of Service Discovery in Chris Richardson's excellent post as part of his series on Microservices.

Chris also has many video talks and articles available online and speaks very eloquently on all matters relating to distributed design which I have greatly enjoyed during my own research. I highly recommend checking them out.

It's make your mind up time.

So this is the first critical point where we had a variety of choices to make in our design.

Do we want smart versus dumb pipes? How about decentralised control with auto-discovery? How does our communication behave? Who controls registration? Is one single approach for all scenarios even practical?

For us they are opinionated and deliberate choices.
Our approach that follows is not inherently better or worse, but each choice has consequences for many of the subsequent design decisions. In many cases, they can actually remove choice.

We will come back to reference these choices in the rest of this series.

It is also worth pointing out that I couldn't try out everything available, so our choice is not a reflection on other solutions out there, it is just the one I felt best fit ServiceStack and suited our needs.

And the winner is

Consul, let's cover the basics of Consul before we tackle how it fits in with ServiceStack.

Consul is a single binary executable that can run on Windows, Linux or iOS. It can run either as a Service, an Agent or for sending Commands to other Consul instances.

We use it as a service registry with client-side self-registration, client-side discovery and this enables point-to-point service RPCs.

Consul Datacenter

Consul, like all service registry patterns is a potential SPF, but is designed for High Availability in mind.

In production, you run an odd number of Server nodes which form a DataCenter (DC), typically three or five. You can scale Consul to connect multiple datacenters.

The odd number is because it implements a consensus protocol based on RAFT which holds leadership elections, and they need a deciding vote to elect a leader.

For the best possible resiliency, server nodes can be spread across physical hardware, network locations and operating systems. Running three instances allows a single node to fail while running five can tolerate two node failures.

Consul is actually a hybrid model of server and client-side, something also found in Netflix's Eureka. This approach avoids one typical drawback of client-side discovery and self-registration systems i.e. network availability and latency.

It avoids this by using local agents on a loopback address.

Consul DataCenter and Agent

Each service has access to an agent co-located on the same physical hardware. Consul uses a gossip protocol Serf for managing membership, failure detection and message broadcasting and RAFT logs to keep each agent's list of services synchronised.

This means lookups and registrations are local and fast with no network hops.

ServiceStack [enters stage left]

This is my discovery solution, there are many just like it, but this one is mine.

So now we've made our first design choices, let me introduce our next plugin.


There is a detailed readme on the project which, as in previous posts, I won't cover here, but the minimum code to configure discovery in your ServiceStack AppHost is as follows:

public override void Configure(Container container)  
    SetConfig(new HostConfig
        // the external url:port that other services will use to access this one
        WebHostUrl = "",

    // Register the plugin, that's it!
    Plugins.Add(new ConsulFeature());

Your ServiceStack instances can now communicate with each other requiring nothing more than a copy of the DTO POCO. This is where ServiceStack and it's DTO message-driven style really shines.

You interact with local and remote services solely through simple DTO POCO message contracts.

For most service discovery solutions, you have to know first which service you want to call. Not so for our plugin.

The difference in calling a local or remote service is indistinguishable in your code.

public class MyService : Service  
    public void Any(RequestDTO dto)
        // The gateway will automatically use the DTO type to find the correct service
        var internalResponse = Gateway.Send(new InternalDTO { ... });
        var externalResponse = Gateway.Send(new ExternalDTO { ... });

This makes it easy to develop all your services in a single instance. You can then split them out as you need to scale, but your calling code remains exactly the same.

There are no references and no uris.

Just look at the code and let that all sink in for a second.... it's more ServiceStack magic and it's so simple, it has caused a few WTFs!

Behind the curtain, the wizard is revealed

So how does it work?


When the AppHost starts up, it registers itself with Consul. In doing so it passes a list of all the DTOs it is able to process.

Combined with ServiceStack's ability to export its DTO's and its native pre-defined-routes this makes it easy to move service methods between projects.

To call a remote method, the callee service only needs to have a copy of the DTO (the contract) with the correct name and structure as the remote service.

The gateway will recognise any DTO it cannot process itself and instead look up the correct service from Consul.

This allows our plugin, with Consul's help, to provide automatic and completely transparent DTO routing.

This also avoids the overheads of message-bus and gateway-style discovery by allowing point-to-point communication between services.

The verbiage on verbs

It is worth expanding slightly to cover how HTTP Verbs work in ServiceStack.

By default on the ServiceClient.Send() and Gateway.Send() or Gateway.SendAsync(), the verb will default to use POST.

There are two methods by which you can control this behaviour.

The first is to use the verb specific methods available on the ServiceClient:

var externalDto = new ExternalDTO();  
var client = new JsonServiceClient("http://myservice");


// HTTP PUT Async call


The second method, which is only available to the Gateway, and the one we therefore have to use, is the IVerb interface markers on the DTOs.

public class ExternalDTO : IGet, IReturn<ExternalDTOResponse>  

// Gateway.Send() + IGet is an alias for Gateway.Get()
Gateway.Send(new ExternalDTO());  

The approach also helps decouple the HTTP verb specifics of any external calls from your call site and instead makes the DTO responsible for defining how it is sent.

But wait, there's more...

In addition, Consul provides another piece of the infrastructure jigsaw which our plugin handles for you - service health which we will cover in our next topic.

The gateway will also select the correct format for retrieving the DTO. If your remote service only communicates in XML, it will transparently call it using XML but return you a POCO.

It will also automatically cache responses from a GET request according to the remote service's cache settings. In some cases, it will not even issue an RPC, instead returning you the DTO response straight from the cache.

Our future roadmap also includes configurable time-out, retry and cache fall-back policies.

Let's get down to brass-tacks, how much for the API..?

We think the simplicity and low-ceremony approach above is really compelling, but it doesn't come for free. There are opinionated choices we've made to allow it to work this way.

So this is where we cover the consequences of those decisions and the first one is a whopper.

We've thrown RESTful routing under a bus

Oh my!

Hiding from RESTafarians

Now we have reasons for this which I cover next in routing. It may be possible to make this work with Consul, but I don't yet see a way to make it robust nor elegant.

DTOs MUST be globally unique.

This one is actually part of the ServiceStack guidelines anyway so we don't feel bad about this at all.

The third is another whopper which I have a whole topic devoted to later on so for now, I won't clarify further but instead lob this like a grenade into the fire-pit.

You cannot EVER make a breaking-change to a DTO

Run Away!!! <Runs away>


Instead of REST and all the great custom and fallback routing options in ServiceStack, we have chosen to use only ServiceStack's pre-defined-routes.

Together with our second consequence of globally unique DTOs, this allows the RPC routing to just work with Consul.

So let me try and explain why we've not only ignored RESTful routing, but will actively seek to prevent it being used directly in our Services.

There are a few reasons behind this but first it might help to clarify that we plan to use services internally at first, but later on expose them externally using a Gateway to be built on top of Consul.

Internally, with ServiceStack's ServiceClient and the DTOs, you already have fully end-to-end typed API calls so never really need to see a URI, let alone care what they are, this isn't so bad for them.

We expect that most of the internal calls will use this typed approach.

You can use custom routes, and the service-to-service calls will even use them. This is not really the problem area though.

Any non-ServiceStack client that wants to consume the services would have to go via Consul to find the right service, and Consul doesn't know a thing about your custom routes.

This affects the few internal apps or services that do not use the ServiceStack client and probably the MOST important group, the external clients.

Friends don't let friends break contracts

Hey Bob,

thank you for being a loyal customer, you mean the world to us.

Because we love you so much Bob, were superduper excited to announce our brand new [feature] and tell you how it will change your life.

You'll literally forget your own name, that's how amazing it is!

Here is our super-secret incrementing beta code, just for our most special customers, like you Bob.

Code: 37,027,491

Thanks again Bob, you're so amazing!!!

p.s. [Feature] requires you re-write all existing integration before launch at 3pm EST tomorrow :)

$#c*$%g WHAT?!

In accessing any external resource, the last thing you want as a consumer, is for that contract to change.


It's painful, it involves additional work you can't plan for, work you don't have time to do.

In HTTP, these are contracts:

// Fragile, things which could change are both 'ordered' and 'embedded'

// Fragile, change requires running multiple endpoints and causes 'churn' for clients

// predefined route *never* changes, DTO is the contract and *will not change*  

In code, these are contracts

// Fragile, change to signature or return type, breaks clients (see WCF, WebAPI)
public string GetAccountOrders(int id, bool includeCompleted) { ... }

// message contract, any change to DTO, does not *have* to break clients
public AccountOrdersResponse Get(AccountOrders request) { ... }  

Contract stability is of paramount importance, but addendum's to contracts are OK.

So clumsily put, if we ensure our DTOs are backward-compatible, we have far more stability in our contracts. Contracts that can tolerate change. Contracts that instil confidence and the trust of consumers.

Another reason for avoiding custom routing in ServiceStack is the complexity of making it work correctly.

In what order do I add this service's routes to the routing table?

Will a fall-back or over-generous catchall route suddenly grab all other services requests?

Will the new dev/team remember to respect the guidelines?

As I mentioned previously, adding an external gateway is part of our future plans and we expect it to handle things like load-balancing, traffic shaping and SSL termination, all in one place, rather than in each service.

If in that future, we must have RESTful routing, it will be as a decoupled, globally managed concern in that gateway, carefully managing the mapping of routes to services. Even this though, by its nature, is static and prone to 'churn' in such a dynamic environment. (see schema changes in ORMs)

We are currently looking at a few options for Gateways so I'll simply mention one that stands out so far, Fabio

It looks to have great integration with Consul and avoids the need for more complex Consul-template solutions. Another one for the roadmap.


Finally, for this (not so micro)-post we come to load-balancing or
the ability to distribute requests between multiple instances of a service.

Definitely our weakest area of the three right now, we have some plans and ideas but they are still in their infancy.

Consul provides service-to service calls with a not-really load-balancing version of load-balancing.

It keeps track of round trip times (RTT) for its agents using network co-ordinates.

If you have multiple instances of a service available to process a DTO, Our plugin will sort these by the agent RTT, giving you the most responsive.

This isn't really load-balancing, more QoS, but it is useful nonetheless and worth mentioning.

Another thing Consul gives us is in how it maintains separate service catalogs per datacenter. Using this ability, we could locate datacenters and their services in different geographic regions to even out global traffic loads.

For true load-balancing though, we have to look for other solutions and they lie outside of each service.

A gateway is the most obvious candidate for this and Fabio allows you to split traffic between services based on rules, useful for things like canary deployments as well as more traditional load-balancing.

In the world of microservices however, we actually have all the ingredients we need to make something ourselves if we need to.

Having a service registry in Consul with RTT, Health and performance metrics information from logging for every service end-point opens up interesting possibilities for using that data. Combined with a good automated deployment pipeline, there are possibilities for elastic scaling. I'll explore this in more detail in the deployment topic.

Wrap it up, chuck!

There is a constant tension between how much 'smarts' you put into each service and how much is centrally managed. We are trying to find a good balance.

The service discovery and registry is a fundamental part of our overall design though. I think it allows us to decouple a lot of the other parts we will need on our journey to microservices.

Parts that can be independent, composable, infrastructure-centric microservices of their own because of this design.

So at last we come to the end of part III.

There was a lot to cover here and there are parts I feel I haven't explained as well as I could, and parts I have skimmed over or left out entirely.

Definitely a couple of things to divide opinions.

If I've missed anything, or you have your own great ideas or projects, let me know in the comments.

Also, we'd love others in the community to get involved with the plugins on Github so don't be shy.


so without further ado...NEXT!

Let's do microservices!

next up: Service health, metrics and performance [coming soon]

00010 : Logging and Debugging

00010 : Logging and Debugging


This post on microservices started innocently as a single post. They always do, don't they?

Before I knew it, it was 10k words and showing no sign of stopping and I was advised to split it up into a series, "nah, I said, it'll be fine".

I was then advised again to split it at least into two, the first few thousand in one and the rest in the other. I relented this time.

Perhaps it is just the nature of software, or perhaps it is 'scope-creep' (we've all been there right!) but, I've only gone and written a monolithic microservices post!

I see the irony here, so I've decided to break it apart into 'micro-posts'; thanks @adamralph :)

Each 'micro-topic' will become a 'micro-post' of it's own in our journey; so my apologies to those who wanted the full meal, your dinner will now be served as Tapas!

ServiceStack, a journey into the madness of microservices
  1. Context: the what and the why?
  2. Distributed debugging and logging
  3. Service discovery, load balancing and routing
  4. Service health, metrics and performance
  5. Configuration
  6. Documentation
  7. Versioning
  8. Security and access control
  9. Idempotency
  10. Fault-tolerance, Cascading failures
  11. Eventual consistency
  12. Caching
  13. Rate-limiting
  14. Deployment, provisioning and scaling
  15. Backups and Disaster Recovery
  16. Services Design
  17. Epilogue

Fr♣m ng pro▓ llem th e

In a monolith world with all the great modern tooling available to you, it is easy to load a project up in your IDE, set some breakpoints, and hit 'Debug'.

You can step through your code line-by-line and inspect its state.

Or you can write integration tests to assess the state of a system and to verify that the components of the system are interacting correctly with each other.

The single-threaded process has rock-solid reliability and low-latency for inter-process communications.

If you call a method on a class, you aren't concerned if it will reach that method-call or if will it take an unacceptable amount of time to get there.

Even at the boundaries of a process, things are often binary. If the database that runs the application or a file resource is unavailable, the application will crash or display an error.

You can check application logs, the system eventlog, the coredump or WinDbg type listeners to help you find, reproduce and fix the problem.

It's all in one place.

Next comes the multi-threaded process, to which nearly all of the above also applies.

How many of you have experienced race-conditions in one of your multi-threaded applications?

I know I have.

Even with the great tooling available, they are often subtle and hard to pin down. Introduce state mutation into the equation and you now need to worry about concurrency. There are locks and semaphores to deal with.

Is that class I used even 'thread-safe'?

Time and the order of execution between threads is no longer reliable, and the side-effects of this increase the level of complexity, of both your code and of your ability to reason about run-time state.

But they still have reliable, fast inter-process communication.

Now let's introduce asynchronous processing. Single-threaded and multi-threaded applications can be synchronous or asynchronous.

Now, within even a single thread, the order of execution is no longer reliable.

There is an excellent overview on the differences, but the take-away is that introducing multi-threading or asynchronous programming to your application can significantly increase the difficulty of reasoning about your system state as well as finding, reproducing and fixing bugs.

Yet they still have reliable, fast inter-process communication.

Now enter the distributed system, you probably see where this is going, don't you?

You are now in the world of remote procedure calls (RPC) across processes or networks. Each remote call is an order of magnitude slower and therefore, so is your application performance.

In our case, with ServiceStack, the RPCs use a request/response message-passing style.

Just as before, RPCs can be done either synchronously or asynchronously, but due to the performance of remote calls, using asynchronous processing is not really a choice, it's practically a must-have..

Now we have to ask ourselves - will calls be inexplicably duplicated? Will they even be invoked at all?

We have just lost cabin pressure. Communication is no longer reliable.

Down the rabbit hole we go, this stuff is hard and it gets weird, really fast.

Debugging this kind of system, which is executed piecemeal across processes and networks, just cannot be done with an IDE and access to a machine or applications logs.

Trying to reconstruct it in a test environment is an exercise in futility.

In the distributed system, debugging must be done in production.

This realisation forces you to approach the design of the system differently from the very beginning; if, you want to avoid creating the distributed equivalent of the Titanic that is!

So first up, logs have to be centralised to be able to reason about your system state, find errors and be able to trace the flow of events across each node.


Now, let's get specific shall we?

For ServiceStack, I've created a plugin on Github that builds upon the Request Logger to log to Seq.


ServiceStack is fortunate to have some great people in its community and the plugin was quickly improved by a fellow member, thank you Richard.

Seq is an installable self-hosted service, with an HTTP API that is designed for log aggregation.

There is a readme available on the project, which I won't duplicate here, that covers setting up the plugin and using it in more detail, but I'll cover the basic code required to use it in your ServiceStack AppHost.

public override void Configure(Container container)  
   // Define your seq location, add the plugin
  var settings = new SeqRequestLogsSettings("http://localhost:5341");
  Plugins.Add(new SeqRequestLogsFeature(settings));

That's it!

See everything!

The plugin now captures every incoming request to ServiceStack in Seq and is capable of logging every detail, not just the path but the headers, the request and response DTOs, execution timing, service exceptions and errors.

In addition, the logging detail can be modified at runtime, so when you need to debug in production, you can ramp up the level of detail logged. I'll come back to this again a few times in later posts.

So the first thing to note is that you are not storing plain text, you are storing structured data and it makes all the difference in the world.

With Seq that data is now easy to search, filter and aggregate using Seq's powerful query language and UI.

Logging in action

Having used Seq for a while now in other applications, I know how quickly it can help you identify and fix issues in your production systems. It's very easy to use and it comes with a free single-user licence. I highly recommend you try it out for yourself.


Unless your logging receiver has high-availability (which Seq at this time does not have), we have just created our first piece of critical infrastructure as a single point of failure (SPF).

When any SPF goes down, bad things can happen, so throughout the series, I'll point these out.

Seq does however have a forwarder. This works against a local loopback address to buffer requests for forwarding onto your log server.

This helps with network unreliability by eliminating remote calls and the performance penalties that are associated with them.

An alternative is to use a UDP broadcast style of logging, like statsd. It may serve your circumstances better.

Our logging uses an http async 'fire-and-forget' style, so the network performance cost is reduced, but, if your service, network or Seq fails and you do not persist ServiceStack logging locally to disk or use a forwarder, you lose potentially valuable log data.

We could improve this plugin in future (PR's welcome!) to include a more resilient, guaranteed delivery to survive network outages; but for now we are not too concerned about this.

Our rationale will become more apparent as our design is revealed.


The second part, distributed debugging, is a much trickier problem.

In distributed systems where there are many moving parts involved, the ability to reconstruct a timeline of events and state-mutations across your services is essential to being able to effectively debug and reason about system's state and overall health.

Enter the correlation identifier, our next plugin on the road to microservices


Again this captures every incoming ServiceStack request and adds an Id to the request header. The service gateway's internal and external calls will pass this identifier on so that you can identify service to service calls from their point of origin in your logs.


It is very early days for this plugin though. We need to refine it to be able to reconstruct a full map of service-to-service calls at each node, but for now, that is on our future road-map.

Being able to map out the calls is important for a couple of reasons which are worth mentioning at this stage though.

re'Curse' of the infinite loop

As your services become distributed, it is easy to create recursive calls, recursive calls, recursive calls, recu... :(

Recursive Call

Having a good timeout policy on all remote calls can help with this, but being able to add self-referencing checks in the correlation plugin can help with this by cancelling such requests.

Am I already in the call-map of the thing calling me?

Yes, byeee

the > never > ending > chain > of > calls > . > . >

Making remote calls easy and transparent to your services is really powerful, but it is also just as easy to abuse that power.

Long call chain

With each network hop, you increase the likely-hood of timeouts and the responsiveness of your API and their consumers suffers.

Having set limits on the length of call-chains can force distributed teams to be judicious in their use of dependencies (see left-pad!) and help foster collaboration between them instead to scale horizontally and keep the stack thin.

Further reading

There is some great research material if you are interested in reading more on this topic.

Google has a whitepaper on Dapper, a large-scale distributed systems tracing infrastructure, and there are a few implementations out in the wild to be found.

I am also looking at Vector Clocks of which you can find a C# implementation by the brilliant @kellabyte, but the fixed length limitations of this algorithm have led me towards Interval tree clocks as another possibility.

Another task for the future, is to capture requests at a lower level, like a proxy, so that tracing can work beyond ServiceStack calls to include data stores, files and external infrastructure resources.

Finally, there are also very interesting possibilities emerging from Joyent for run-time inspection and tracing using dTrace and containers worth keeping an eye on too.

That concludes part II, if I've missed anything, or you have your own great ideas or projects, let me know in the comments.

Also, we'd love others in the community to get involved with the plugins in Github so don't be shy.


OK, enough of that for now, NEXT!

Let's do microservices!

next up: Service discovery, load balancing and routing