John said that the cat on the tablewas
Figure 2: Dependency structure for John said that the cat was on
the table..
Since linguistic descriptions tend to be at a high level of abstrac-
tion, there will be a certain amount of unpredictability in the graph-
ical result. This same tradeoff is seen in computational models of
behavior [21, 12] and natural phenomena. We also acknowledge
up front that it is infeasible to fully capture the semantic content of
language in graphics. But we do believe that a large number of in-
teresting 3D scenes can be described and generateddirectly through
language, and likewise that a wide variety of text can be effectively
depicted.
In the remainder of this paper, we describe each of the com-
ponents of WordsEye, starting with an overview of the linguistic
analysis techniques used.
2 Linguistic Analysis
The text is initially tagged and parsed using a part-of speech-tagger
[7] and a statistical parser [9]. The output of this process is a parse
tree that represents the structure of the sentence. Note that since the
parser is statistical, it will attempt to resolve ambiguities, such as
prepositional phrase attachments, according to the statistics of the
corpus on which it was trained (the Penn Treebank [19]). The parse
tree is then converted into a dependency representation (see [16],
inter alia) which is simply a list of the words in the sentence, show-
ing the words that they are dependent on (the heads) and the words
that are dependent on them (the dependents). Figure 2 shows an
example dependency structure, with arrows pointing from heads to
their dependents. The reason for performing this conversion from
parse tree to dependency structure is that the dependency represen-
tation is more convenient for semantic analysis. For example, if we
wish to depict the large naughty black cat we might actually have
no way of depicting naughty, but we still would like to depict large
and black. To do this we need merely to look at cat’s dependents
for depictable adjectives, which is in general simpler than hunting
for depictable modifiers in a tree structure headed by cat.
The next phase of the analysis involves converting the depen-
dency structure into a semantic representation. The semantic repre-
sentation is a description of the entities to be depicted in the scene,
and the relationships between them. The semantic representation
for the sentence John said that the cat is on the table is given in
Figure 3. The semantic representation is a list of semantic repre-
sentation fragments, each fragment corresponding to a particular
node of the dependency structure. Consider “node1”, which is the
semantic representation fragment for the action say, deriving from
the node say in the dependency structure. The subject is “node2”
(corresponding to John), and the direct object is the collection of
“node5”, “node4” and “node7”, corresponding to nodes associated
with the subordinate clause that the cat was on the table. Each
of these nodes in turn correspond to particular nodes in the depen-
dency structure, and will eventually in turn be depicted by a given
3D object: so John will be depicted (in the current system) by a
humanoid figure we call “Mr. Happy”, and table will be depicted
by one of a set of available 3D table objects.
2
the Mirai 3D animation system from IZware, and uses 3D models from
Viewpoint.
2
An individual semantic representation fragment as currently used in
WordsEye may seem relatively simple when compared, say, with the PAR
(("node2" (:ENTITY :3D-OBJECTS ("mr_happy")
:LEXICAL-SOURCE "John" :SOURCE SELF))
("node1" (:ACTION "say" :SUBJECT "node2"
:DIRECT-OBJECT ("node5" "node4" "node7")...))
("node5" (:ENTITY :3D-OBJECTS ("cat-vp2842")))
("node4" (:STATIVE-RELATION "on" :FIGURE "node5"
:GROUND "node7"))
("node7" (:ENTITY :3D-OBJECTS
("table-vp14364" "nightstand-vp21374"
"table-vp4098" "pool_table-vp8359" ...))))
Figure 3: Semantic representation for John said that the cat was on
the table.
Semantic representation fragments are derived from the depen-
dency structure by semantic interpretation frames. The appropriate
semantic interpretation frames are found by table lookup, given the
word in question. These frames differ depending upon what kind
of thing the word denotes. For nouns such as cat or table, Words-
Eye uses WordNet [10], which provides various kinds of semantic
relations between words, the particular information of interest here
being the hypernym and hyponym relations. The 3D objects are
keyed into the WordNet database so that a particular model of a cat,
for example, can be referenced as cat, or feline or mammal, etc.
Personal names such as John or Mary are mapped appropriately
to male or female humanoid figures. Spatial prepositions such as
on are handled by semantic functions that look at the left and right
dependents of the preposition and construct a semantic representa-
tion fragment depending upon their properties. Note that there has
been a substantial amount of previous work into the semantics of
spatial prepositions; see, inter alia, [5, 14, 15] and the collections
in [11, 20]; there has also been a great deal of interesting cross-
linguistic work, as exemplified by [22]. There have been only a
small number of implementations of these ideas however; one so-
phisticated instance is [24]. One important conclusion of much of
this research is that the interpretation of spatial relations is often
quite object-dependent, and relates as much to the function of the
object as its geometric properties, a point that ties in well with our
use of spatial tags, introduced below in Section 3.1.
Finally, most verbs are handled by semantic frames, which are
informed by recent work on verbal semantics, including [18]. The
semantic entry for say is shown in Figure 4. This semantic entry
contains a set of verb frames, each of which defines the argument
structure of one “sense” of the verb say. For example, the first
verb frame, named the SAY-BELIEVE-THAT-S-FRAME, has as
required arguments a subject and a THAT-S-OBJECT, or in other
words an expression such as that the cat is on the table. Optional
arguments include action location (e.g., John said in the bathroom
that the cat was on the table) and action time (e.g., John said yes-
terday that the cat was on the table.) Each of these argument spec-
ifications causes a function to be invoked to check the dependen-
cies of the verb for a dependent with a given property, and assigns
such a dependent to a particular slot in the semantic representation
fragment. WordsEye currently has semantic entries for about 1300
English nouns (corresponding to the 1300 objects described in Sec-
tion 3.1), and about 2300 verbs, in addition to a few depictable ad-
jectives, and most prepositions. The vocabulary is, however, readily
extensible and is limited only by what we are able to depict.
In addition to semantically interpreting words that denote par-
representation of [3]. But bear in mind that an entire semantic represen-
tation for a whole scene can be a quite complex object, showing relations
between many different individual fragments; further semantic information
is expressed in the depiction rules described below. Also note that part of
the complexity of PAR is due to the fact that that system is geared towards
generating animation rather than static scenes.