This paper presents improvements on a system for large-scale learning known as "parameter server". The parameter server is designed to perform reliable distributed machine learning in large-scale industrial systems (1000's of nodes). The architecture is based on a bipartite graph composed by "servers" and "workers". Workers compute gradients based on subsets of the training instances, while servers aggregate the workers' gradients, update the shared parameter vector and redistribute it to the workers for the next iteration. The architecture is based on asynchronous communication and allows trading-off convergence speed and accuracy through a flexible consistency model. The optimization problem is solved with a modified proximal gradient method, in which only blocks of coordinates are updated at a time. Results are shown in an ad-click prediction dataset with $O(10^{11})$ instances as well as features. Results are presented both in terms of convergence time of the algorithm and average time spent per worker. Both are roughly half of the values for the previous version of the parameter server (version called "B" in the paper). Roughly 1h convergence time using 1000 machines each with 16 cores and 192Gb RAM, 10Gb Ethernet connection (800 workers and 200 servers). Other jobs were concurrently run in the cluster. The authors claim it was not possible to compare against other algorithms since at the scale they are operating there is no other open-source solution. In the supplementary material they do compare their system with shotgun \cite{conf/icml/BradleyKBG11} and obtain faster convergence (4x) and similar value of the objective function at convergence.