The paper addresses the problem of finding a policy with a high expected return and a bounded variance. The paper considers both the discounted and the average reward cases. The authors propose formulate this problem as a constrained optimization problem, where the gradient of the Lagrangian dual function is estimated form samples. This gradient is composed of the gradient of the expected return and the gradient of the expected squared return. Both gradients need to be estimated in every state. The authors use a linear function approximation to generalize the gradient estimates to states that were not encountered in the samples. The authors use stochastic perturbation to evaluate the gradients in particular states by sampling two trajectories, one with policy parameters theta and another with policy parameters theta+beta, where beta is a perturbation random variable. The policy parameters are updated in an actor-critic scheme. The authors prove that the proposed optimization method converges to a local optimum. Numerical experiments on a traffic lights control problem show that the proposed technique finds a policy with a slightly higher risque than the optimal solution, but with a significantly lower variance.