One bad item can reduce perceived quality of recommendation list. Sometimes this may be particularly undesirable such as recommending horror movies to children. Authors argue that this happens when missing not at random data is handled improperly and separate groups of users and items overlap during the process of dimensionality reduction and computation of embeddings. Folding is a metric that measures the severity of described effect in a recommendation model.
To calculate folding we must introduce the notion of relatedness between user $i$ and item $j$ which captures the likelihood of interaction between $i$ and $j$, regardless of the rating. In a way this is a form of smoothing the interaction matrix. There are different ways to calculate relatedness, but authors propose to solve matrix factorization task using WALS with high weight for missing interactions or use SVD for this purpose.
Given predicted score and relatedness matrixes $S, R \in \mathbb{R}^{m \times n}$ we can calculate folding as the average across all interactions
$$Folding = \frac{1}{mn} \sum_{i,j}max(0, s_{ij}-r_{ij})$$