They represent an image as a tree where leafs are pixels and nodes represent clusters of those pixels. They train by regressing for some possible segmented region $r$ on the following function for every segmentation example and ground truth:
$$S(r)=\frac{\\#(g) - \\#(r)}{\max(\\#(r), \\#(g)))}$$
Here $\\#(g)$ is the number of pixels in the ground truth and $\\#(r)$ is the number of pixels in the example segmentation. What is not explained here is what other information is used because it cannot simple be pixel counts. This function is used to rank the nodes in every path from the root to the leafs in Figure (a).
The idea for the segmentation is that there is some set of nodes such that you can draw a line shown in Figure (b) which is equivalent to selecting a segmentation. The paper goes on to compute this using a dynamic programming solution based on the fact that the same pixel segmentations will be considered multiple times.
![](http://i.imgur.com/FEky9dK.png)
I think the idea is great but the initial idea for the regression is unclear.