Hierachical token bucket

Martin Devera aka devik (devik@cdi.cz)

Abstract

This paper discusses another implementation of the CBQ packet scheduler [1]. It focuses mainly on scheduling precision and ease of implementation. I'll show how to generalize the "bounded" parameter, replace the rate estimator by a token bucket one and introduce a new algorithm which efectively implements formal sharing (as described in [1] section 3). I assume that the reader is familiar with CBQ theory.

Formal link sharing guidelines revisited

The goals described in [1] remain the same. Let's refresh them:

Each interior or leaf class should recieve roughly its allocated link-sharing bandwidth over appropriate time intervals, given sufficient demand.
If all classes with sufficient demand have recieved at least their allocated link-sharing bandwidth, the distribution of any excess bandwidth should not be arbitrary but should follow some set of reasonable guidelines.

As in the original paper the second goal is not discussed here. This behavior is affected by used scheduler algorithm. Most CBQ implementations use deficit round-robin (DRR) [4] in which excess bandwidth is distributed according to the weight parameters specified. I will address related issues later in this paper.
In [1], formal guidelines are described which implement the goals above. They are repeated here for your convenience:

A class can continue unregulated if one of the following conditions hold:

The class is not overlimit, OR
The class has not-overlimit ancestor at level i, and there are no unsatisfied classes in the link-sharing structure at levels lower than i.

(see [1] for terms' declarations).

The second rule ensures that we don't borrow bandwidth from parent class while there is child which itself can send without borrowing.

Iterative approach

I argue that following algorithm conforms to the formal guidelines. Let's call it iterative approach. First we set L to 1 and use this rule to decide whether the class can send:

The leaf can continue unregulated only if there exists no overlimit ancestor class at level L.

In the case where all leaves are overlimit (step 1 didn't yield a packet) we increment L and try again until we either get packet or L is bigger than the level of root class.
It is not hard to prove correctness of algorithm above. For L=1 the algorithm behaves exactly as first rule of formal guidelines. To agree with the second rule we have to assure that we don't allow a leaf to borrow from class at level N if other leaf exists which has demand and can borrow from class at lower level. In iterative approach we always test lower levels before higher ones thus we satisfy both formal sharing rules.

The algorithm can be implemented in faster way. During the first step (with L=1) we can remember the lowest level we can send at and use it directly in the second pass (if the first one didn't yield us a packet). I call this two phase approach.

Comparison between top-level and iterative approaches

The iterative approach implements formal sharing rules exactly while top-level (proposed in [1]) uses heuristic to approximate them. Hence the iterative approach quarantees you the best possible precision.
At first sight it is clear that the iterative approach is more expensive. Top-level can determine class status immediately while two phase variant can need up to one full scan over all active leaves. The complexity increases with amount of borrowing done. However my tests showed that difference in the real case is not so critical.

Precise top-level

While the algorithm described in the previous paragraphs is simple to implement it has still the worst case complexity O(N) where N is number of active leaves. Even as complex algorithm as WF2Q [5] is faster (O(log N)).
Top level heuristic as used in [1] has complexity O(1) but it is inacurate especially if more flows are active.
Here I'll outline how to change top level heuristics to comply with the formal guidelines as precisely as possible and how to do it with the worst case complexity O(log N).

The key idea is to maintain one sorted list of classes per level. Each list contains only active classes from single level and these are sorted by time which they become underlimit at. When we need to know top level then we can simply look heads of the sorted lists up starting from lowest level. The first level we find underlimit class at becomes new top level.
The ceiling will make it a bit more complicated but still it is not hard to prove correctness of the idea above.

Note that this idea was not tested and should be treated as such.

Estimators

In [1] the authors state that overlimit status of class is determined by estimator and they then discuss idle time based estimator used in prototype implementation. The same estimator is used in all CBQ implementation I'm aware of.
I see two problems with idle based estimator. At first it is relatively hard to implement without having precise clock or without knowing hardware speed of network device (for example variable rate modems or wireless adapters). The second "problem" (minor one) is that it is hard to interpret idle estimator parameters (except for rate of course).

I replaced the estimator by a classical token bucket one [2]. It is simple to setup as the most of network administrators have knowledge about it. It uses only two parameters (rate and burst) of which both have unambiguous physical meaning.
In my feeling the major contribution is fact that token bucket algorithm is simple to implement on systems with finite nonzero granularity of timer and it doesn't depent on hardware rate of network device.

The token bucket algorithm has to remember not only positive tokens (burst) but also negative one at least for non leaf classes. It is because parent class can be overlimit while one of its children is underlimit. Sending from the child causes overlimit parent to be yet more overlimit thus forcing tokens to be negative.
To assure that the overal goal of scheduler is still satisfied, inner classes must allow for negative tokens up to sum of bursts of all descendant classes. This limit can however be set to arbitrary large value, say 2 minutes because it is unlikely that user will use bursts bigger than several seconds. Bursts are in this case specified in the time domain, whereas they are usually expressed in bytes. These two views can be converted bidirectionaly using the token bucket's rate).

Rate ceiling

The original paper distinguished at least between bounded and not bounded class. Bounded class can't borrow bandwidth from its parent. I generalized this a bit and blended the border between them. Each class has a second estimator attached to it which I call "ceil". Each class can borrow as long as it is under ceil. Setting ceil and rate estimators to the same value is equivalent to a bounded class. Setting ceil to "infinity" (or other high value, like the link's speed) is equal to a non-bounded class.

Probably you see the advantages of this approach. One can guarantee 1Mbit to an agency and allow it to use up to 2Mbit (ceil) if the link is idle. From the implementation side it is trivial. Just another estimator test in underlimit parent lookup loop.

General scheduler problems

There is one problem with using DRR (used in [1]) as general scheduler. When class oscilates between being and not being regulated it is often skipped in DRR loop and it loses it's quantum. It then leads into inadequate sharing of excess bandwidth. I believe that all CBQ implementations suffer from this problem but it is not so visible because idle estimator uses relatively large time constant. In HTB is it more visible because its token bucket resolution is very close to the DRR's round time.

The simple solution is to change DRR to use separate deficit and active class pointer for each borrowing level. Because active pointer is private for given borrowing level then class borrowing from other level (and possibly forcing us to be regulated) can't mess our DRR position.
Tests in my implementation shows rate ratios improvement ranging from 5 to 50% when implementing this simple solution.

Implementation

I tested all parts described here in Linux based implementation [3]. There are no graph or other results in this paper because they corresponds to those found in [1].

Future directions

There is possibility to cache last scheduler decision. We could cache it as long as we have enough DRR tokens. It makes possible to skip whole CBQ algorithm for several packets and account them to estimators when first non cached packet is to be dequeued. It could speed things up by large factor.

Use of clock based scheduler like WF2Q with link sharing could be helpful to better isolate interactive traffic and provide better delay bounds.

Acknowledments

I'd like to say thanks to numerous people incuding Jamal Hadi Salim for helpful comments, Bert Hubert who helped me with this paper and many others.

[1] Link-sharing and Resource Management models for Packet Networks, Sally Floyd and Van Jacobson, 1995
[2] Token Bucket ???? - can't remember the paper :(
[3] HTB for Linux, http://luxik.cdi.cz/~devik/qos/htb
[4] Efficient Fair Queuing using Deficit Round Robin, M. Shreedhar and G. Varghese
[5] WF2Q: Worst-case Fair Weighted Fair Queuing, J.C.R. Bennet and Hui Zhang