This paper discusses some amazing results. The goal is to learn how to count by end-to-end training. The network input is an image and the output is a count of the objects inside it. They do not perform any direct training using the locations of the objects in the image.
The reason for avoiding direct training is that labeled data is expensive. Employing a surrogate objective ,such as the count of items in the image, is much cheaper and makes more sense because it is the goal of the system we want to learn. This paper states that it is possible! The discuss experiments on two datasets; one of MNIST digits placed in an image and one with the UCSD Pedestrian Database.
The network description seems to be general and they don't report any special constraints on the design `"We consider networks of two or more convolutional layers followed by one or more fully connected layers. Each convolutional layer consist of several elements: a set of convolutional filters, ReLU non-linearities, max pooling layers and normalization layers."` and `"We use a five layers architecture CNN with two convolutional layers followed by three fully connected layers"`. They provide these two tables for their designs:
$$\begin{array}{c|c|c|c}
Conv1 & Conv2 & FC1 & FC2 \\ \hline
10\text{x}15\text{x}15 & 10\text{x}3\text{x}3 & 32 & 6 \\
\text{x2 pool} & \text{x2 pool} & & \\ \hline
\end{array}\\
\text{CNN arch for numbers}$$
$$
\begin{array}{c|c|c|c|c}
Conv1 & Conv2 & FC1 & FC2 & FC3 \\ \hline
8\text{x}9\text{x}9 & 8\text{x}5\text{x}5 & 128 & 128 & 25 \\
\text{x2 pool} & \text{x2 pool} & & \\ \hline
\end{array}\\
\text{CNN arch for people}$$
They state that they use a method based on hypercolumns \cite{1411.5752} but the description is not clear at all: `" Starting with the hypercolumn representation
on the last layer we cluster the resulting hypercolumns
into a set of prototypes using an online k-means
algorithm. Then, a MIL approach with positive and negative
instances with the concept of interest is used."`
![](https://i.imgur.com/x2q3E9Y.png)
Interesting work but I wish it was a longer paper with more details. This paper doesn't really give me enough information to reproduce it.