Leshem Choshen's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Diversity-Driven Combination for Grammatical Error Correction
Wenjuan Han and Hwee Tou Ng
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Leshem Choshen 3 years ago

Model combination\ensembling:
Average ensembling is practical - but naive.
Combine considering each network's strengths, much better!
Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)
#enough2skim #NLProc

The basic idea is quite simple:
Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one.
This was actually introduced in our previous work (as admitted by the authors) in
aclanthology.org/W19-4414.pdf
The paper's addition:
1. Given a set of black-box models we may train at least one of them to be different from the rest with RL.
2. we can use more sophisticated NNs to combine the outputs
3. we can ignore domain knowledge for the combination (I am not sure this is a bonus)
Results are very strong. Especially nice is that they show that the diversity training indeed helps

My criticism:
The comparisons are always to SoTA, this is meaningless. The authors propose different parts (the diversity, the combination and the combined models).
It is unclear whether ensembling after the diversity would be preferable over their's or not.
Similarly, they compare to Kantor et al., but Kantor provided a combination method, why not compare on the same models, or combine with Kantor's method the models after the diversity training?
To conclude, I really like the direction, and ensembling is a very practical tool that for some reason was not improved in a long time.

arxiv.org
arxiv-vanity.com
scholar.google.com

Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer
Eleftheria Briakou and Sweta Agrawal and Joel Tetreault and Marine Carpuat
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Leshem Choshen 3 years ago

Evaluation of ecaluation for style transfer in multiple languages.
@ebriakou, @swetaagrawal20, @Tetreault_NLP, @MarineCarpuat
arxiv.org/pdf/2110.10668…
They end up with the following best practices:
Capture formality - XLM-R with regression not classification
Preservation - with chrf not BLEU
Fluency - XLM-R but there is room for improvement
System Ranking - XLM-R and chrf
Crosslingual Transfer - rely on zero shot not machine translation 
Why chrf and not chrf++ I wonder?

arxiv.org
arxiv-vanity.com
scholar.google.com

Examining the Inductive Bias of Neural Language Models with Artificial Languages
Jennifer C. White and Ryan Cotterell
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Leshem Choshen 3 years ago

Are all language orders as hard?
Supposedly, for RNNs yes, for Transformers no

@JenniferCWhite @ryandcotterell
aclanthology.org/2021.acl-long.…

github.com/rycolab/artifi… (currently empty)
#NLProc
Really cool, with a caveat
Adapted from: https://twitter.com/LChoshen/status/1450809889106931716?s=20

GitHub - rycolab/artificial-languages
Contribute to rycolab/artificial-languages development by creating an account on GitHub.
https://github.com/rycolab/artificial-languages
The paper creates synthetic languages (using a PCFG) with various ordering rules, being able to compare each order.
Image
They also add agreement, and a vocabulary to introduce more of real language important features (e.g. long-distance dependencies)
Last, the words are taken from words that could have been in English (e.g. daxing)
Their results are the Transformers do have inductive biases, but those are not towards the most probable language orders (e.g. VOS is quite easy for Transformers)
Image
Another interesting thing is that RNNs do not show that. Despite all that we know about their strong inductive bias regarding forgetting (hard to relate far away words\information).
Last, it seems that a lot of what makes things easy and hard is related to the agreement, at least from looking at which rules matter together.
Image
I really like this direction (which they seem to discuss a lot, seems like a reviewer was quite annoying) and the results are interesting.

However, honestly, I also have a lot of criticism about this work.
A lot of the details are missing. For example, how were the words tokenized? If BPE, then morphology is split to tokens perfectly? If none, then their sophisticated way of choosing words doesn't matter, because they anyway convert the words into one hot vectors?
How large were the vocabularies? What were the parameters of the networks train? There are works saying that depth changes inductive biases and not only arch...
Because of the large number of networks (~128 * 10 reps), they train on really small amounts of data. Hard to say to which amount did that matter, and whether with a small vocab it is reasonable. Still, hard to generalize from the results because of it.
Last, and that is especially peculiar:
"Weights given to each production were chosen manually through experimentation. Some principles for choosing weights for a grammar in this manner are described by Eisner and Smith (2008)"
So, we can't know what type of sentences they create (balanced trees, long sentences, really low weights on something that makes a phenomenon\switch less interesting). And they don't share the method or even the chosen weights.
To sum, this is a very interesting approach, with a lot of potential, that could benefit from more analysis of the results (e.g. per switch), and analysing myself is also problematic as a lot is left out of the paper (nor there is an appendix with the info.)

arxiv.org
arxiv-vanity.com
scholar.google.com

ComSum: Commit Messages Summarization and Meaning Preservation
Leshem Choshen and Idan Amit
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CL, cs.LG, cs.SE
more

[link] Summary by Leshem Choshen 3 years ago

Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset
The dataset cleans tons of open source projects to have only ones with high quality committing habits 

(e.g. large active projects with commits that are of significant length etc.)
We present some ways to evaluate that the meaning was kept while summarizing, so you can go beyond ROUGE
We provide a strict split that keeps some (thousand+-) repositories totally out of the training, so you can check in domain and out of domain or just be sure results are clean.

If you ever want an even larger dataset, follow the same procedure and use more repositories (we took only ones active in 2020, pick ones that are active no longer or wasn't active until now)

Dataset in https://figshare.com/articles/dataset/CumSum_data_set/14711370
Code is found in https://github.com/evidencebp/comsum
Paper in https://arxiv.org/pdf/2108.10763.pdf

Leshem Choshen

sciscore: 1