Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer
arXiv e-Print archive - 2021 via Local arXiv
First published: 2023/06/10 (just now) Abstract: While the field of style transfer (ST) has been growing rapidly, it has been
hampered by a lack of standardized practices for automatic evaluation. In this
paper, we evaluate leading ST automatic metrics on the oft-researched task of
formality style transfer. Unlike previous evaluations, which focus solely on
English, we expand our focus to Brazilian-Portuguese, French, and Italian,
making this work the first multilingual evaluation of metrics in ST. We outline
best practices for automatic evaluation in (formality) style transfer and
identify several models that correlate well with human judgments and are robust
across languages. We hope that this work will help accelerate development in
ST, where human evaluation is often challenging to collect.
Evaluation of ecaluation for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
They end up with the following best practices:
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation
Why chrf and not chrf++ I wonder?