This paper takes up the question of whether rhetorical relations can be automatically derived and classified. It focuses, in particular, on discourse markers. These may be ambigious (e.g 'since', 'yet' have multiple uses and are sometimes, but not always, discourse markers); and these discourse markers may also be missing altogether.
The authors comment that: "what is needed is a model which can classify rhetorical relations in the absence of an explicit discourse marker." (p4). Previous work (e.g. Marcu & Echihabi 2002) has suggested creating training data for a classifier by labelling examples which contain an unambiguous lexically marked rhetorical relation, then removing the markers. The main purpose of this paper is to empirically test this.
It also provides an interesting theoretical observation: Two conditions are needed for training on marked examples to work well:
"First, there has to be a certain amount of redundancy between the discourse marker and the general linguistic context, i.e. removing the discourse marker should still leave enough residual information for the classifier to learn how to distinguish different relations."
Second, similarity between marked and unmarked examples is needed so that a classifier can make generalizations.
The paper suggests that texts with lexically marked and lexically unmarked rhetorical relations may be inherently different, in so far as removing discourse markers may change the meaning of a sentence, and classifiers built based on removing markers from classified sentences work little better than chance.