First published: 2024/10/13 (just now) Abstract: We present ComSum, a data set of 7 million commit messages for text
summarization. When documenting commits, software code changes, both a message
and its summary are posted. We gather and filter those to curate developers'
work summarization data set. Along with its growing size, practicality and
challenging language domain, the data set benefits from the living field of
empirical software engineering. As commits follow a typology, we propose to not
only evaluate outputs by Rouge, but by their meaning preservation.
Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset
The dataset cleans tons of open source projects to have only ones with high quality committing habits
(e.g. large active projects with commits that are of significant length etc.)
We present some ways to evaluate that the meaning was kept while summarizing, so you can go beyond ROUGE
We provide a strict split that keeps some (thousand+-) repositories totally out of the training, so you can check in domain and out of domain or just be sure results are clean.
If you ever want an even larger dataset, follow the same procedure and use more repositories (we took only ones active in 2020, pick ones that are active no longer or wasn't active until now)
Dataset in https://figshare.com/articles/dataset/CumSum_data_set/14711370
Code is found in https://github.com/evidencebp/comsum
Paper in https://arxiv.org/pdf/2108.10763.pdf