Thucydides 1.89-118: A Multi-layer Treebank

Citation with persistent identifier:

Mambrini, Francesco. “Thucydides 1.89-118: A Multi-layer Treebank.” CHS Research Bulletin 1, no. 2 (2013).


§1  Digital annotated corpora are nowadays an indispensable resource for linguistic studies. Since the creation of the first “treebanks” (as the collections that embed a word-by-word morphological and syntactical analysis are called), these resources have been fruitfully employed in a broad spectrum of contexts. The applications range from corpus-based studies on linguistic problems, to grammar induction, natural language processing (NLP) and research in language acquisition.[1]

§2  While originally the annotation of the first treebanks was limited to morphology and syntax, corpus linguists soon became interested in enriching the corpora with information on more complex layers of analysis, like semantics and pragmatics (Leech 2004). Most recently, annotated corpora have also extended their range of applications to another area that lies outside that of traditional theoretical or applied linguistics. In a world of global and immediate communication, the goal of information extraction has gained a sudden prominence. Global policy making, for example, requires audiences to face threats that challenge the international community across all kinds of linguistic and cultural borders; in consequence, virtually every citizen of the world should be able to access the relevant information available on special issues.[2] Naturally, the expressiveness of natural language and linguistic diversity are a great challenge to this aim.

§3  In recent years, the first digital annotated texts of Ancient Greek and Latin literature have become available to the public.[3] As it was the case with the first treebanks, the annotation that they integrate is currently still limited to syntax and morphology. This paper will present the first attempt to extend the level of linguistic annotation on Ancient Greek texts to semantic and pragmatic phenomena. In doing so we will draw from the experience of one of the most important treebanks available, the Prague Dependency Treebank (PDT) of Czech (Bohmová et al. 2001), and on the theoretical framework on which it is based (Sgall et al. 1986).

§4  In particular, we will focus on the interpretative work that is needed to apply the model of the PDT to Ancient Greek literary texts. Arguably, one the most crucial requirements in designing an annotated corpus is the composition of a set of guidelines, such that the “metalanguage” used to describe the linguistic phenomena is made clear and that the ambiguities in the concrete work of tagging are reduced to a minimum. Since the project of a multi-layer corpus of Ancient Greek is at its beginnings, we will present the elaboration of annotation guidelines in the making.

§5  The relation between linguistic theory and text corpora has always been complex and some times in conflict. In an important paper, Hajičová and Sgall (2006) have shown that the interaction between theory and annotation can be more articulate than the simple opposition between bottom-up and top-down approaches from theory to data. Rather, the interaction can be represented as a (virtuous) circle: established grammars of a language serve as a basis for the design of annotation guidelines; the process of annotation, in turn, provides an occasion to test how theory can account for concrete linguistic data. Finally, this new insight can lead to the revision and, possibly, to the emendation of the starting principles.

§6  In our effort to build a multi-layer treebank, we will follow this direction, although in a much smaller scale. We will start by adopting the annotation guidelines that have been elaborated for Czech and (partially) for English.[4] We will apply this schema to a limited section of an Ancient Greek text:  the chapters of the Histories of Thucydides dedicated to the so-called Pentecontaetia of Book I (89-118), the narration of how Athens extended its domination on a large section of the Greek world from the end of the Persian Wars to the beginning of the Peloponnesian War. This preliminary annotated corpus (which counts 178 sentences and 4641 words in total) will then serve as the basis for compiling the first guidelines for the so-called tectogrammatical annotation of Ancient Greek.[5]

§7  This paper will present some cases that illustrate the work. We will start with a summary sketch of the annotation schema that we will use (sec. 2). We will then offer a short series of case studies to present some of the problems that are encountered when the annotation schema is applied to Ancient Greek (sec. 3).

A Four-Level Scenario

Text, Morphology and Syntax

§8  Following the model of “Functional-Generative Description” (Sgall et al. 1986), the PDT distinguishes three strata of linguistic annotation that are added on top of a corpus.

§9  At the first level, all words and punctuation marks attested in the corpus are analyzed in terms of their part of speech and the flexional morphology, and are assigned to a lemma. In Ancient Greek, for example, the word ἦλθον would be indexed under the lemma ἔρχομαι and analyzed as verb, third person singular, indicative aorist. At this stage, the words are still considered in isolation: no relation is assumed between them.

§10  The second level, called “analytic” layer in the PDT schema, is concerned with syntax. The text is divided in sentences, and the syntactic relations between all words and punctuations within each of them are described according to a dependency-based formalism, where the connections are drawn between words, going from a dominant head to the dependent(s); no intermediate constituent (e.g. a noun phrase) is introduced in the representation.[6] According to dependency grammars, for example, a verb dominates its arguments, while nouns govern their attributes. In the analytical layer of the PDT, these relations are described with a set of labels that correspond mainly to the syntactic roles of standard grammar (e.g. SBJ for subject, OBJ for direct and indirect object, etc.).

Figure 1: A sentence from Thucydides in the AGDT format
Figure 1: A sentence from Thucydides in the AGDT format

§11  These two layers of annotation of the PDT provided the model for the first treebank of Greek literary texts of Archaic and Classical age, the Ancient Greek Dependency Treebank (AGDT) promoted by the Perseus Project.[7] Figure 1 represents a sentence from Thucydides, Histories I 89.1 in the format used by the AGDT. As is visible in the structure of the XML file where the annotation is stored, each word is described by a series of attributes. While the attribute “form” reproduces the word as it stands in the text, “lemma” and “postag” record the main morphological analysis. The “head” attribute points to the numeric “id” of the governing word. Finally, “relation” stores a label for the syntactic relation between head and dependent.[8] Figure 2 represents the syntactic annotation described above as a dependency tree.

Figure 2: Thucydides 1.89.1. Syntactic tree
Figure 2: Thucydides 1.89.1. Syntactic tree

Surface and Deep Syntax: Tectogrammatical annotation

§12  Although up to twenty labels are used to distinguish between the different relations, the analysis of the syntax that is encoded at this level is rather coarse-grained. The main distinction that involves verbal complements, for example, is that between required arguments and circumstantials (tagged as ADV, like τρόπῳ in figure 2); in the case of these adverbials, however, no difference is made between e.g. modals, temporal expressions, or complements of means.

§13  That distinction is in fact not a matter of pure syntax, but involves a measure of semantic interpretation. In the PDT stratified model, it is therefore left for a level where the different aspects of linguistic meaning are captured. In the framework of Functional-Generative Description, this layer is called “tectogrammatical”.

§14  In extreme synthesis, the tectogrammatical representation captures the meaning of a sentence by annotating a combination of different factors:

  1. the semantically relevant words and the dependency structure formed by them; a detailed set of semantic-syntactic labels (the so-called “functors”) is used to describe these relations;
  2. the semantic information that is expressed through the morphology (number, gender, aspect etc.), represented as properties of the words (with the so-called “grammatemes”);
  3. coreference resolution;
  4. topic-focus articulation and communicative dynamism.

§15  In a tectogrammatical tree, the sentence structure is represented as a dependency tree, just like in the analytical layer. The nodes of the trees, however, are not all the words attested in the text, but only those who carry an independent lexical meaning (which are therefore represented by the lemmata). The “technical” items that modify other words by introducing special nuances of meaning (like modal verbs, auxiliaries in general, articles, prepositions or conjunctions) are not represented as nodes per se, but rather as properties of the lexical nodes. Therefore, for example, in the Greek participial phrase: ἀμύνειν βουλόμενοι Αἰγινήταις (Thucydides, Histories I 105.3: lit. “wanting to help the Aeginetans”) the participle is not reproduced in the tectogrammatical tree; rather, the verb ἀμύνειν acquires the deontic modality that represent events as “wanted/intended”. Along with deontic modality, other “grammatemes” record the meaning expressed by the morphological categories, such as number, genre, tense and mood.[9]

§16  A fifth component that can be added to the list quoted above is ellipsis resolution. Very often in natural language, some information is left unexpressed for the sake of brevity: this happens very frequently with coordinated structures.[10] In such cases, the omitted nodes are always reconstructed in the tectogrammatical representation.

§17  The most important type of reconstructed nodes is represented by valency arguments, which are very often omitted and inferred from the context in ordinary communication. In a pro-drop language like Greek the ellipsis of the subject is extremely frequent.

§18  In tectogrammatical sentence representation, the concept of valency does indeed play a crucial role. Since the seminal work of Tesnière (1959:102), the valency of a word is defined as the list of obligatory participants that are required to fill each of the distinct meanings of a lexical unit. In our section of Thucydides, for example, the verb συντίθημι (“set/put together”) is used with two different senses: “narrate” (Histories I 97.2; see LSJ II.3) and (in the middle voice) “conclude an agreement on” (Histories I 115.3; see LSJ B.II.1). The first meaning requires two arguments: a narrator and the content that is narrated. The second case is more complex, but we can reconstruct by examining carefully the examples when the verb is construed in this peculiar sense with the two parties that reach the agreement, the deal that is concluded (e.g. τὴν ξυμμαχίαν), and optionally the terms of the agreement. All these complements make two distinct valency frames for the same verb. Valency arguments can be omitted whenever they can be easily reconstructed from the context or whenever they are left intentionally unexpressed, but are semantically always required; they are therefore always integrated in the tectogrammatical trees, and the newly introduced nodes can take the conventional label of “personal pronoun” (whenever it is sufficient to integrate a pronoun such as “she”, “he” or “it” into the reconstruction of the phrase), “general”, or “unspecified” (when the argument is intentionally left unexpressed).

§19  Coreference resolution is the task of linking together all the mentions of the entities that have the same referent. It is a crucial task for text interpretation that all the readers of Ancient Greek texts are (even unconsciously) familiar with,[11] since it allows readers to disambiguate the reference of pronouns and anaphoric/cataphoric adjectives. One important advantage of tectogrammatical interpretation is that even the newly introduced nodes are linked to their co-referents: the implied subjects for instance are connected to the word that is inferable from the context. Three different types of textual coreference are distinguished in the tectogrammatical annotation: references to a precisely identifiable nodes in the preceding/following text, “segment” coreference for words that resume or anticipate whole paragraphs or sections, and deictic reference to extra-discursive entities.

§20  Contextual boundness and topic-focus articulation plays also a decisive role in Ancient Greek. Its importance for understanding the order of constituents in the sentence has been recently stressed with decisive arguments.[12] The notion of topic and focus however may be very different according to theoretical framework that is adopted for the analysis.

§21  In the PDT, every node of the tectogrammatical tree is annotated with three different labels according to their contextual boundness and communicative dynamism.[13] The prototypical contextually bound items are the words and concepts that have been already mentioned in the discourse. Contextual boundness, however, is (somewhat misleadingly) defined not only in term of referential “giveness”, but rather of “aboutness”. Elements that were never mentioned in the preceding context can nevertheless be present to the memory of the parties involved in the communication; they can be selected by the speaker as “given” elements about which the focus part is predicated.[14]

§22  Communicative dynamism, on the other hand, corresponds to the extent to which a linguistic element contributes towards the further development of communication.[15] A qualifying adjective, for instance, is considered more dynamic than its head, since it narrows the reference of a noun or brings a further qualification to it; for that reason, the adjective is annotated as a “contextually non-bound” node, regardless of the “givenness” of its head.

§23  It is nowadays generally known that topic can carry a contrastive meaning, whenever topic elements implicitly or explicitly suggest a set of alternatives on which different foci hold.[16].Hajičová et al. (1998:150) introduced the notion of contrastive topic in the theory in order to make room for the fact that typical focusing operators (like English “even”) are found also within the topic part. On the other hand, focus is intrinsically understandable as a choice between a set of possibilities. Whenever a sentence has one contrastive element, it is likely that this element occupies the position of focus; when the contrastive items are more than one, they are distributed between topic and focus (Hajičová and Sgall 2004). For that reason, a special label for contrastive contextually bound elements has been introduced in the annotation system. However, the guidelines for distinguishing non-contrastive and contrastive topics are still under definition; as a result, the distinction is left to the sensitivity of the annotators, as it is reflected by the significant degree of inter-annotator disagreement registered in the process of manual annotation (Veselá et al. 2004).

§23  Once the nodes are annotated for contextual boundness, they are then assigned to either the topic or focus part of the sentence.[17]

§24  The stratified structure of the PDT has been called “a three-level scenario”. In our application of the model to a corpus of Ancient Greek texts we prefer however to make reference to a four-level structure. In the domain of classical philology, the text itself, on which the annotation is built, is far from being a definitive acquisition that serves as a basis for annotation. Rather, it should be considered part of the reconstruction work that goes side by side with the linguistic interpretation, even as it rests on the specific methodologies of textual criticism. We will see an example of this interplay between treebank and texts in one of the following sections (3.2).

Figure 3: Thucydides 1.106.2: A four-level scenario
Figure 3: Thucydides 1.106.2: A four-level scenario

§25  Figure 3 illustrates the model of our “four-level scenario”, marked by the progressive stratification of the layers of interpretation, with one example from the Pentencontaetia of Thucydides.

The art of linguistic annotation

§26  In this section, we will see the process of annotation at the tectogrammatical layer in action. We will discuss two examples from Thucydides, Histories I 89-118 that touch some of the areas of tectogrammatical sentence representation, while at the same time involving as many layers of the four-level scenario as possible. In the first example, we will concentrate especially on topic-focus articulation and coreference resolution, while mentioning also a problem related to the “tense” grammateme. The second example will also involve the level of “surface” syntax, valency, and textual reconstruction.

§27  The discussion of the problems posed by a multi-layer linguistic annotation could have also been organized according to the different phenomena: constitutio textus, valency of verbs, coreference, information-structure and so on. We believe however that the model of a textual commentary can be more useful for a number of different reasons.

§28  Firstly, on account of the novelty of the task, we believe that it is important to provide scholars and students who want to engage in annotation with concrete cases, problems and possible solutions they can be faced with. A textual example is certainly a better illustration of the methodology than a general discussion.

§29  Secondly, the decisions that are taken case by case on each linguistic phenomenon are not independent. As the examples below will show, conclusions on valency, temporal relations, or coreference are highly influenced by the general interpretation of each context. It is hardly possible, and eventually not useful, to separate aspects that, as it is intrinsically typical of the tectogrammatical layer of annotation, are different facets of the same problem of the meaning.

§30  Finally, discussing how the meaning of a sentence changes according to the different solutions offered by the tectogrammatical annotation schema for each problem will hopefully clarify how expressive and fine-grained linguistic annotation is. Conversely, the discussion will also help pointing out the limits of tectogrammatical annotation, by highlighting areas where the different nuances of the scholarly debate are not captured by the attributes and relations of the corpus tokens.

Thucydides, Histories I 89.1


οἱ γὰρ Ἀθηναῖοι τρόπῳ τοιῷδε ἦλθον ἐπὶ τὰ πράγματα ἐν οἷς ηὐξήθησαν


It was in the following manner that the Athenians attained the condition in which they rose to greatness.[18]

§31  This sentence marks the beginning of the digression known as Pentecontaetia. What follows until chapter 118 is introduced as the history of how Athens grew to such a state of wealth and power that the clash with Sparta was inevitable.

§32  The context needs to retain our attention briefly. The last paragraph of chapter 88 concluded the episode of the meeting of the Peloponnesian allies (started at chapter 67) by stating that the main reason that brought the Spartans to vote for the war was not the arguments brought by the allies, but the fear that Athens might become even more powerful than it was already. Now Thucydides turns to justify those fears: the rise of the Athenian empire is told in full.

§33  Both in terms of reference and of “aboutness”, οἱ Ἀθηναῖοι is thus contextually bound and serves as the shifting-topic for the Pentecontaetia: from the Peloponnesian side we move to Attica. In the bipolar world of the War of Pelopennesus, the shift to the other side of the conflict can be thought to imply contrast. As we have seen, criteria for distinguishing contrastive contextually-bound elements haven’t been completely defined in the PDT annotation practice; it is worth exploring, therefore, the implications of the choice between contrastive and non-contrastive boundness in full.

§34  In this context, the interpretation of οἱ Ἀθηναῖοι as “contrastive topic” would arguably be wrong. The fact that an element is a choice from a well-defined set of alternatives, as the two cities and their allies in the Histories are, is not sufficient. In prototypical cases, the set of alternatives are involved in the meaning of a sentence, either because they are explicitly realized in the text or because they are implied by the context.[19] In the model of Büring (2003), contrastive topics presuppose a strategy and, if we recur to the question-test, a superordinate question that dominates the particular question at hand. In this specific case, the introductory statement of the Pentecontaetia is not part of a general strategy meant to argue “in which way y did the city x become powerful”. In other words, this sentence does not presuppose the fact that other states came to power in other ways (while Athens did it in this way, that the author proceeds to describe).

§35  The sentence, on the other hand, does not even fit in a more general contrastive strategy, as the one evoked by the superordinate question “what happened to the cities in conflict” (the Peloponnese League debated, while Athens built the empire). The section that follows is in continuity with what precedes: it illustrates how the situation was determined (note the opening γάρ), even if we change the point of view. Therefore, Ἀθηναῖοι should be marked as a non-contrastive contextually bound expression.

§36  On the other hand, this sentence has a perceptible main focus, which is represented by τρόπῳ τοιῷδε. The deictic pronoun, which typically belongs to the topic part, is here used in cataphora, to introduce the narrative that follows and thus some new information. The same strategy of introducing episodes or descriptive sessions with cataphoric pronouns/adjectives is also found several times in Herodotus.[20]

§37  The status of the verb and its second argument (ἦλθον ἐπὶ τὰ πράγματα ἐν οἷς ηὐξήθησαν) is less obvious. In prototypical cases, the main predicate expresses the central event of a sentence and thus belongs to the focus part. In this example, however, the main idea that Athens rose to power (which is expressed with a lexically vague expression such as “came to the situations”, ἦλθον ἐπὶ τὰ πράγματα) has been already stated clearly in chapter 88. The “question test” would yield the result that this sentence is more suitable as an answer to the hypothetical question n.1 than n.2:

  1. How did the Athenians come to power?
  2. What happened to the Athenians?

§38  We are therefore led to conclude that the verb ἔρχομαι and its complement πράγματα should be annotated as contextually-bound words that belong to the topic part.

§39  Figure 4 shows a tectogrammatical representation of the sentence, with the nodes reordered according to the principles of “communicative dynamism” as the PDT model presupposes.[21]

Figure 4: 1.89.1: tectogrammatical tree (re-ordered)
Figure 4: 1.89.1: tectogrammatical tree (re-ordered)

§40  The cataphoric adjective τοιόσδε points to the long list of sentences that will illustrate the opening statement. As we saw, elements that don’t refer to a well-defined node in the tree, but rather embrace a long section of text, are annotated with the special “segment” coreference. In the annotation, therefore, only the fact that the demonstrative encompasses an unspecified segment of text is recorded.

§41  Yet, we may be legitimately encouraged to ask how far the scope of this adjective extends. The question has in fact been debated by some scholars and, ultimately, it touches the more general problem of the unity of the Pentecontaetia. The same problem has also been approached from another linguistic angle, that of the exact temporal relation between the main predicate (ἦλθον) and the verb of the relative (ηὐξήθησαν).[22]

§42  To begin with, we may note that this last problem is not reflected in the tectogrammatic representation. In a tectogrammatical node, information on the verb tense is indeed recorded in a grammateme that has possible values of simultaneity, anteriority and posteriority. But these values are recorded only insofar as they are expressed by the morphology. The aorist indicative may be used for “past in the past” (Rijksbaron 2002:20). But the temporal relation of anteriority between the two states of affairs referred to in the aorist is determined by the context, or possibly by the lexical meaning of a verb[23]: there is nothing in the aorist tense itself that carries this sense.[24]

§43  According to some critics, two moments are distinguished in I, 89.1: the Athenians reach a favorable position of hegemony, and then they transform this primacy into an empire (that marks the apex of their prosperity). The first sentence of I, 89 covers only the former part; the point of transition between the two phases is generally recognized at chapter 97, where the narration is interrupted by a paragraph that summarizes the evolution of the Athenian empire and a justification for Thucydides’ digression (Maddalena, 1952; Classen and Steup, 1919; Poppo and Stahl, 1885).

§44  Yet the interpretation that sees a reference to two historically separate phases in the “double preface” of chapters 89 and 97 is only an apparent solution, which conceals a bigger problem of the excursus. The second introduction at chapter 97, with the mention of Thucydides’s “predecessor” Hellanikos, is puzzling and the narrative that follows looks certainly different and more unadorned than the first part (Hornblower 1991:149). Furthermore, Thucydides consistently provides readers with only one time frame for its excursus on the rise of the Athenian hegemony, that goes from the retreat of Xerses and the siege of Sestos (told in 89.2) to the outburst of the Peloponnesian War in a period of “roughly fifty years” (118.2).[25] The impression is that the presence of a second introduction has more to do with problems of composition of the Pentecontaetia than with an author who presupposes an inner periodization of the fifty years of Athenian splendor.[26]

§45  Although we started with a problem of coreference resolution (what part of the text falls within the scope of the cataphoric τοιόσδε?), all the aforementioned debate is not captured in the tectogrammatical annotation.

Thucydides, Histories I 91.5

Figure 5: A tentative syntactic interpretation of the transmitted text
Figure 5: A tentative syntactic interpretation of the transmitted text


τήν τε γὰρ πόλιν ὅτε ἐδόκει ἐκλιπεῖν ἄμεινον εἶναι καὶ ἐς τὰς ναῦς ἐσβῆναι, ἄνευ ἐκείνων ἔφασαν γνόντες τολμῆσαι, καὶ ὅσα αὖ μετ᾽ ἐκείνων βουλεύεσθαι, οὐδενὸς ὕστεροι γνώμῃ φανῆναι


For when it seemed best to abandon the city and embark on the ships, [they said?] they had resolved and taken this bold step without the Lacedaemonians, and again in all matters in which they took counsel with them [they said?] they had shown themselves inferior to none in judgment.

§46  This is the sentence of the paragraph as transmitted by the manuscripts and printed by Jones and Powell (1942).[27]

§47  In itself, the sentence seems unobjectionable and very easy to construe: ἔφασαν is the main verb, which governs the two clauses (whose heads are τολμῆσαι and φανῆναι) that report the content of what is said. Syntactically, it would also seem natural at first sight to attach the nominative participle γνόντες to the main verb, with which it agrees in number. This prima facie interpretation is diagrammed in the tree of figure 5.

§48  Yet, when we turn to the valency of φημί and we try to integrate the Actor (syntactically, the missing subject) by looking to the context, it is easy to see that this construction rests on an impossible interpretation.

§49  The only (plural) candidate for the role of Actor are the three Athenian ambassadors, Themistocles and his colleagues Abronicos and Aristeides, who are mentioned at I, 91.3. But the three ξυμπρέσβεις are never said to act together. It is Themistocles who steps forth (ἐπελθών, I, 93.4) and breaks the news to the Spartans that the walls of Athens are now completed. The sequence of sentences that report the speech is, as many editors have seen, utterly coherent in isolating the leader of the embassy from the colleagues: the first part of the indirect speech is introduced by ὁ Θεμιστοκλῆς…εἶπεν, followed by our sentence; then another indirect sentence is referred in the infinitive, without governing verb (δοκεῖν οὖν σφίσι…), and finally again the singular ἔφη, scilicet Themistocles, introduces the last section.

§50  If we bracket ἔφασαν with Krueger and most of the editors, we are left with a much clearer sequence, where the two verba dicendi, whose subject is Themistocles, frame two indirect sentences without main predicate.[28]

§51  Indeed, it seems virtually certain that ἔφασαν is a gloss, most likely inserted to provide an easier construction for the nominative plural γνόντες (thus Maddalena 1952). The gloss, however, falls short even in respect to that task. For in the context of the sentence, γίγνωσκειν, as γνώμη often does, indicates the deliberation process that precedes the action (Poppo and Stahl 1885, ad I, 70, Dover 1965 ad VII, 48): it cannot but be referring to the subjects of τολμῆσαι, i.e. the citizens of Athens. But if we retain ἔφασαν, then it would be more natural to attribute it to the ambassadors and would be completely pointless.

§52  On the contrary, not only the participle becomes perfectly intelligible once the main predicate is removed; even the syntax of the sentence is considerably improved and much more in line with the use of Thucydides. Although the construction seems irregular, a plural participle is attracted to the nominative in the indirect speech of a singular speaker also in Thucydides, Histories VI 25 and VII 48; in both passages, the verbum dicendi is omitted. According to Kühner and Gerth (1904:29n3), the attraction is motivated by the fact that the speaker (Nikias in both occasions) is represented as the spokesman of the group, an explanation that is perfectly at home in our context too.

Figure 6: Thucydides 1.91.5. Tectogrammatical tree
Figure 6: Thucydides 1.91.5. Tectogrammatical tree

§53  A treebank with analytical and tectogrammatical annotation would be extremely helpful in this case. It would be very easy to interrogate the corpus in order to extract other eventual examples of nominative participles depending on infinitives and test all kind of possible linguistic hypotheses on the attested cases.

§54  In addiction, as we have seen and as is visible from Figure 6, it is important to note that the tectogrammatical representation of the sentence integrates both the main verb from the preceding context (εἶπε) and a reference to the implied agent (Themistocles). Not only is tectogrammatical annotation an elegant way to restore Themistocles as responsible for the course of actions, as Thucydides clearly intended to portray him, but, more importantly, this information is also stored and recoverable from the treebank. In others words, one of the possible applications of our multi-layer corpus would be to extract all actions that, in the narration of an historian like Thucydides, are assigned to different agents, collective as well as individual. This type of interrogation can allow for many important content-related studies of narrative texts.

§55  The tectogrammatical tree of the sentence is reproduced in figure 6.


Abeillé, A., ed. 2003. Treebanks. Building and Using Parsed Corpora. Boston.

Apresjan, J., ed. 2012. Meaning, Text, and other Exciting Things. A Festschrift to Commemorate the 80th Anniversary of Professor Igor Alexandrovic Mel’cuk. Moskow.

Bakker, E. 2007. “Time, tense, and Thucydides.” The Classical World 100:113–122.

Bamman, D., and G. Crane.  2009. Guidelines for the syntactic annotation of Ancient Greek treebanks. The Perseus Project, Tufts University,

Bamman, D., F. Mambrini, and G. Crane. 2009. “An ownership model of annotation: The ancient Greek Dependency Treebank.” Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories, 5–15. Milan.

Bohmová, A., J. Hajič, E. Hajičová, and B. Hladká. 2001. “The Prague Dependency Treebank: A three-level annotation scenario.” In Abeillé 2013:103–127.

Büring, D. 2003. “On d-trees, beans, and b-accents.” Linguistics and Philosophy 26:511–545.

Classen, J., and J. Steup, eds. 1919. Thukydides. Erklärt von J. Classen. Bearbeitet von J. Steup  Volume 1. Berlin.

Crocker, M. W., and J. Siekmann, eds. 2010. Resource-Adaptive Cognitive Processes. Berlin and Heidelberg.

Debusmann, R., and M. Kuhlmann. 2010. “Dependency grammar: Classification and exploration.” In Crocker and Siekmann 2010:365–388.

Dik, H. 1995. Word Order in Ancient Greek: A pragmatic account of word order variation in Herodotus. Amsterdam.

———. 2007. Word Order in Greek Tragic Dialogue. Oxford.

Dover, K. J., ed. 1965. Thucydides. Book VII. Oxford.

Firbas, J. 1992. Functional Sentence Perspective in Written and Spoken Communication. Cambridge.

Gomme, A. W. 1945. A historical commentary on Thucydides. Vol. I: Introduction and Commentary on Book I. Oxford.

Gundel, J., and T. Fretheim. 2004. “Topic and focus.” In Horn and Ward 2004:175–196.

Hajičová, E. 2012. “Topic-focus revisited (through the eye of the Prague Dependency Treebank).” In Apresjan 2012:218–232.

Hajičová, E., B. Partee, and P. Sgall. 1998. Topic-Focus Articulation, Tripartite Structures, and Semantic Content. Dodrecht.

Hajičová, E., and P. Sgall. 2004. “Degrees of contrast and the topic-focus articulation.” In Steube 2004:1–13.

———. 2006. “Corpus annotation as a test of a linguistic theory.” In Proceedings of the Fifth International Language Resources and Evaluation (LREC’06), 879–884. Genoa.

Haug, D. T. T., and M. L. Jøhndal. 2008. “Creating a parallel treebank of the old Indo-European Bible translations.” In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 27–34. Marrakech.

Horn, L. R., and G. Ward, eds. 2004. The Handbook of Pragmatics. Oxford.

Hornblower, S. 1991. A commentary on Thucydides. Vol. I. Book I-III. Oxford.

Jones, H., and J. Powell, eds. 1942. Thucydides. Historiae. Oxford.

Jowett, B. 1881. Thucydides, Volume 1. Oxford.

Kühner, R., and B. Gerth. 1904. Ausführliche Grammatik der griechischen Sprache. Zweiter Teil: Satzlehre. Zweiter Band. Hannover.

Leech, G. 2004. “Adding linguistic annotation.” In Wynne 2004:17–29.

Maddalena, A., ed. 1952. Thucydidis Historiarum Liber Primus, Volume 2. Firenze.

Matić, D. 2003. “Topic, focus, and discourse structure: Ancient Greek word order.” Studies in Language 27:573–633.

Poppo, E. F., and J. M. Stahl, eds. 1885. Thucydidis De Bello Peloponnesiaco Libri Octo. Explanavit E.F. Poppo. Editio Tertia, quam auxit et emendavit I.M Stahl (3 ed.), Volume 1. Leipzig..

Powell, J. E. 1938. A lexicon to Herodotus. Cambridge.

Rijksbaron, A. 2002. The Syntax and Semantics of the Verb in Classical Greek: An Introduction (3rd ed.). Amsterdam.

Sgall, P., E. Hajičová, and J. Panevová. 1986. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Dodrecht.

Smith, C. F., ed. 1919. Thucydides I. History of the Peloponnesian War. Books 1 and 2. Cambridge.

Steube, A., ed. 2004. Information structure: theoretical and empirical aspects. Berlin.

Tesnière, L. 1959. Éléments de syntaxe structurale. Paris.

Veselá, K., J. Havelka, and E. Hajičová. 2004. “Annotators’ Agreement: The Case of Topic-Focus Articulation.” In Proceedings of the 4th International Conference on Language Resources and Evaluation 2191–2194. Lisbon.

Veselá, K., N. Peterek, and E. Hajičová. 2003. “Topic-focus articulation in PDT: prosodic characteristics of contrastive topic.” The Prague Bulletin of Mathematical Linguistics 79-80:5–22.

Vossen, P., E. Agirre, N. Calzolari, C. Fellbaum, S.-K. Hsieh, S.-K. Huang, H. Isahara, K. Kanzaki, A. Marchetti, M. Monachini, F. Neri, R. Raffaelli, G. Rigau, M. Tescon, and J. Van Gent.  2008. “KYOTO: a system for mining, structuring, and distributing knowledge across languages and cultures.” In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). Marrakech.

Wynne, M., ed. 2004. Developing Linguistic Corpora: A Guide to Good Practice. Oxford.

[1] The essays collected in Abeillé (2003) provide an excellent (though now rather outdated) introduction to treebanks.

[2] Global warming and climate change are good examples of such challenges. Projects in this domain have prompted a good amount of work in NLP technologies for extracting content from texts. See e.g. Vossen et al. (2008).

[3] Two treebanks of Ancient Greek texts are currently being developed: for the Ancient Greek Dependency Treebank see below, section 2.1; for the PROIEL corpus, see Haug and Jøhndal (2008).

[4] The guidelines for the annotation of Czech are available at:; for the tectogrammatical annotation of English see:

[5] “Tectogrammatical” annotation is discussed and explained below, in sec. 2.2.

[6] In a dependency tree, relations must be “acyclic” or non-recursive, which means that a word cannot depend on itself, not even transitively. On the formal constraints of dependency trees see Debusmann and Kuhlmann (2010).

[7] AGDT: For an introduction, see Bamman et al. (2009).

[8] For a complete list of the twenty labels used see Bamman and Crane (2009).

[9] Note that it is the semantic, not the morphological, value of that is recorded in the grammatemes. To give an example, in the case of pluralia tantum, the grammateme “number” would register “singular” if only one item is intended by the author, regardless of the (grammatically) plural noun that is used in the text. We will mention some of the problems of defining grammatemes for Greek below, in section 3.

[10] See for example the sentence: Mary likes lilies and Anna [likes] roses, where the second verb is most often left out.

[11] Remarks in the commentaries such as “λαμβάνει: τὸ τεῖχος nämlich” (Classen and Steup 1919) to a passage like Thucydides, Histories I 91.1 (κατηγορούντων ὅτι τειχίζεταί τε καὶ ἤδη ὕψος λαμβάνει) is indeed an example of a discursive coreference resolution for an implied subject: the note not only signals that a subject of λάμβανει must be integrated, but it also suggests the word from the context that must fill this role. In a tectogrammatical tree, the same operation is performed in two phases: 1. ellipsis recontruction of a valency element (the subject), where a new node is introduced and labelled as a fictitious “personal pronoun”; 2. the newly generated node is then linked to the word it refers to, namely τεῖχος (coreference resolution).

[12] See especially Dik (1995); Dik (2007) and Matić (2003).

[13] One exception to this rule is represented by the structural heads of coordinate clauses, which are not annotated for information structure.

[14] On the distinction between “referential given” and “relational given” see Gundel and Fretheim (2004).

[15] The work of J. Firbas (see e.g. 1992) is central for the notion of “communicative dynamism”.

[16] Contrastive topic is usually marked by a secondary intonation stress: see Büring (2003) and Veselá et al. (2003) for Czech. For contrastive topic in Greek, see Dik (1995:27) and the example of Herodot, Histories II 35.3 discussed there.

[17] For the algorithm used to assigning contextually bound and non-bound elements to topic or focus see most recently Hajičová (2012).

[18] For the English transaltions of Thucydides I have consulted especially Jowett (1881) and Smith (1919).

[19] The former is the typical case of contrastive μέν…δέ clauses: see Dik (1995). For the latter, consider the following dialogue: “Q: How is your sister?  A: My younger sister is FINE” (capital letters indicate the words where the main stress lies and thus the focus); the answer implies that the speaker has at least another (older) sister for which the content of the utterance does not hold.

[20] Cfr., e.g., 1.31.2: τούτοισι γὰρ ἐοῦσι γένος Ἀργείοισι βίος τε ἀρκέων ὑπῆν καὶ πρὸς τούτῳ ῥώμη σώματος τοιήδε· ἀεθλοφόροι τε ἀμφότεροι ὁμοίως ἦσαν, καὶ δὴ καὶ λέγεται ὅδε ὁ λόγος, where the strategy is redoubled. Most notably, see 3.1.1, with the remarks of Dik (1995:56, n. 89) on the peculiar position of the constituent δι’ αἰτίην τοιήνδε at the end of the sentence as “added Focus”. Many more examples of τοιόσδε and ὅδε referring forward can be found in Powell (1938:257-8 and 358 respectively).

[21] It must be noted however that the reordering of the nodes, which is shown here as an example, has not been undertaken yet for the other sentences.

[22] Cfr. Gomme (1945:256): “ἦλθον is pluperfect in sense”.

[23] This is the case with Herodot, Histories I 74.2, quoted by Rijksbaron (2002): [it happened (συνήνεικε) that the day was suddenly turned to night], τὴν δὲ μεταλλαγὴν ταύτην τῆς ἡμέρης Θαλῆς…προηγόρευσε ἔσεσθαι, “Thales had predicted etc.”; the chronological priority of Thales’ prediction is expressed also by the preverb προ-, that yields the meaning “declare beforehand”.

[24] On the aorist in Thucydides’ Histories see also Bakker (2007), who draws the attention to the performative meaning of the tense. This and many complex facets of the meaning and function of Ancient Greek tenses are not easy to accommodate in the schema designed for Czech language. Clearly, a more fine-grained and suitable system must be crafted. For the present, only the tenses that carry an unequivocal temporal meaning have been annotated, while for the others the grammateme is left empty.

[25] Cf. 97.2: τοσάδε ἐπῆλθον πολέμῳ τε καὶ διαχειρίσει πραγμάτων μεταξὺ τοῦδε τοῦ πολέμου καὶ τοῦ Μηδικοῦ and 118. ἐν ἔτεσι πεντήκοντα μάλιστα μεταξὺ τῆς τε Ξέρξου ἀναχωρήσεως καὶ τῆς ἀρχῆς τοῦδε τοῦ πολέμου.

[26] Cf. Gomme (1945:363 n.1): “it is a not unnatural inference that 89-96 is in fact the beginning of a rewriting of the whole excursus”. The problem is also connected with the (problematic) chronological relation between I, 97.2 and the work of Hellanikos, on which see Hornblower (1991:147-9).

[27] The OCT edition of Jones and Powell is reproduced in the Perseus Digital Library and was therefore used as the basis for the treebank.

[28] Obviously, in the tectogrammatical representation the verbal node of these two trees is reconstructed from the previous sentence, as if εἶπεν was carried on. See figure 6.