Final report

IBM 4: VRML+VoiceXML

Pavel Strnad
Jiří Pokorný
Aleš Friedl

Winter term 2005/2006

1 Project task

From IBM company we received following task:

Investigate existing free VRML browsers. Choose suitable one and extend it with an existing voice recognizer (will be supplied with documentation and precisely defined interface) and with a TTS (Text-to-Speech). Goal is to be able to create simple 3D multimodal applications. Consider different variants of interconnection – VRML extension, VXML browser, connection with stand-alone ASR (Automatic Speech Recognition) and TTS (Text-To-Speech). Discribe advantages of particular variants and implement suitable one.

2 Solution analysis

2.1 Purpose and demands delimitation

Our first task was to investigate posibilities of interconnection of modalities.

First place we wanted our product as simple and general as it could be. By generality we mean the widest value of usability for multimodal application development.

Further we required maximum independence of particular modules (VRML browser, ASR, TTS). Not least we wanted to save compatibility with VRML and VXML standards.

2.2 Resources from submitter

At the start of the project we got software package Embedded ViaVoice, which provides API for ASR and TTS applications.

2.3 Evaluation of solution possibilities

We determined a few criteria to evaluate solutions:

  1. complexity of application development
  2. limitations of possibilities for application development
  3. preservation of VRML and VXML standards
2.3.1 Embedded ViaVoice + VRML browser

The first possibility which we concern with was both connecting Embedded ViaVoice and VRML browser with a set of simple functions, which will perform more complex tasks, and writing dedicated scripts appointed to ensure the right event flow from and into VRML. In other words it means writing a program, which will both control the voice input/output via Embedded ViaVoice and communicate with our VRML browser. Such comunication will be possible through our class "Comm" using RMI. How it would work is described on the diagram no. 1.

Image no. 1

Image no.1

This solution was not suitable, because the whole application logic was not only in Application module, but in VRML (in a form of scripts) as well. It makes final program less extensible. The advanage of such solution lies in the possibility of writing almost any thinkable application.

Resolution
-complicated writing of applications
-fragmentation of the application logic
-inextensibility
+many possibilities

Including dialogs directly into the application was another thing which we did not want to do. Common standard for developing voice applications is VoiceXML. Unfortunately we did not have any VoiceXML brower, so we considered to write some simple implementation of it using Embedded ViaVoice, but this all in all did not look like well-chosen solution. Next time we were consulting our project with IBM specialists we find out that VoiceXML browser already exists and in a short time it was placed at our disposal. So we decided to use VoiceXML.

2.3.2 VoiceXML Browser + VRML browser

The next think was to find suitable VRML browser. We decided to communicate with this browser via EAI (External Authoring Interface), which enables to control VRML browser from our program. Finally we decided to use Xj3D browser, which provides this interface in java.

VoiceXML browser obtained from IBM was implemented in C++, but we get simple java interface too. Unfortunately the interface was without documentation, we only have a very small example how to work with it, so we had to find out its potential mainly via trial-and-error method.

We decided to use Java as our programming langue to write our project with, mainly because we had both VRML and VoiceXML interfaces accessible from java.

2.4 Vving

Our application, which we named Vving, will be talking to VoiceXML browser and to VRML browser, co we had to find out good definiton of communication between both modules.

First we decided to specify simple XML language, because it would be the easiest way for users to create their multimodal applications. Our XML specification included some basic functions for manipulating VRML and VXML.

After our consultation in IBM we were warned, that first we should come with some java interface, which will be much more easy to modify adequately to applications needs. After the interface will be finished, we should suggest suitable XML. Despite of the fact that we did not have enough time to finish XML in adequate complexity, our suggestion is in appendix 1, actually not describing the whole functionality of our interface.

So what we should do was a library in java, suitable for writing multimodal applications. The fragmentation of application logic, which would be included in application, VRML and VoiceXML was still painfull, so we have decided to isolate application logic from interface logic completely.

Our goal was to provide users with API for simple and well arranged connection between VRML and VXML and so do not have to modify VRML or VXML documents. The core of our goal is to provide the ability to work with events in both modalities and their safe synchronisation.

Model of components - image no. 2

Image no. 2 - the architecture

2.5 Locations

Because we wanted to give user a choice how his application will look like, we decided to keep VRML and VXML in our framework as much equal as possible in terms of domination. Then users can write applications, where VoiceXML will guide us through a VRML world, as well as complex VRML world where just one avatar will be speaking.

Non dominance of either modality means that final users of the application will have possibility of choosing how to control the application (via voice, via VRML browser or even via both voice and browser). Let's imagine following example:

We have object, which user can interact with in two ways. Either user can say the right voice command, or he can navigate to it using mouse/keyboard. At the moment he comes to the object, we have to start another VoiceXML form, which will say what to do next and waits for the answer. If user came through voice command, we would have to keep the consistency of application and of the user's avatar. That means we have to move user to the object. Similarly we would have to start voice dialog if user navigated to the object in VRML.

In the case of user approaching by voice command there is a problem - while moving user to the object, the object sensor maintaining consistency is still active and invokes an event, which was originally meant to be invoked by user navigating in VRML. The action performed by this sensor should move user to the right place in VXML, but user is already there. The situation is described on the following image:

Image no. 3 - cascaded events

To solve the conflict we introduced the system of locations. User can be in only one location at a time and when he leaves this location, special actions are performed, which ensure save move to another location. In the previous example, being at one object would be represented by one location, being at other object by other location.

Location can be imaginated as set of event handlers, which are always active together. Basically location defines application state. Important feature of the location is the ability to ensure safe transit to another location. So it includes the mechanism for handling so called exit events, which are simply the exit points from the location. These event handlers are specified by user and have special feature - they are performed exclusively. If one exit event is processed, every other exit event handlers are inactive. So we can move to another location not worrying about conflicts with similar events in the second modality. Processing exit event should be always finished by entering some location - other, or even the same we were just leaving.

In the previous example both event handlers would be parts of the location out of the object and both would be exit events.

Because we decided to use locations as the base system, it would be useful to define event handlers independent on any standard location. Hence we defined special location, called superlocation, which is suitable location for these - location independent - event handlers. Locations completely encapsulates event handlers and hides underlying subsystems' events.

3. Implementation

Implementation is completely documented in the appendix 1 and 2 and a in this documentation to framework (javadoc format)

4. How to use our framework

4.1 Initialization of components

We can suppose that VRML and VXML files are ready to use and functional, so we have to connect them. First, application should register an object inherited from class ApplicationState. This listener receives a message about initialization both VRML and VXML subsystem.

VvingFactory.setStateListener(new State());

Next step is loading apropriate documents.

VvingFactory.getVxmlCommander().loadURL("file:///c:/gallery.vxml");
VvingFactory.getVrmlCommander().loadURL("file:///c:/gallery.WRL");

4.2 Setting up event handlers

If documents are loaded and VRML subsystem is initialized, OnVRMLInitialized() method of object State is called. Now locations should be created, every events we want to catch registered, first VXML dialog started and via gotoLocation enter the first location. Now the application is initialized and we are waiting for an event. Following source code is from "shapes" (at CD).

class State extends ApplicationState {
	public void OnVRMLInitialized() {
		try {
			ISuperLocation s = VvingFactory.getSuperLocation();
			
			// create a new location
			Location noshape = new Location("noshape");
			
			// create nodes in which we will add event listeners
			VRMLNode sphere = new VRMLNode("ps_Sphere1");
			VRMLNode box = new VRMLNode("ps_Box1");
			VRMLNode cone = new VRMLNode("ps_Cone1");
			
			VRMLNode sphere2 = new VRMLNode("ps_Sphere1");
			VRMLNode box2 = new VRMLNode("ps_Box1");
			VRMLNode cone2 = new VRMLNode("ps_Cone1");
			
			// add event listeners on fields
			sphere.addVRMLEventListener(new ProximitySensorIn("sphere"), "enterTime");
			box.addVRMLEventListener(new ProximitySensorIn("box"), "enterTime");
			cone.addVRMLEventListener(new ProximitySensorIn("cone"), "enterTime");
			
			// same for VXML
			VXMLForm vxmlForm = (new VXMLForm("shapeform"));
			vxmlForm.addVXMLFieldListener(new VxmlNoShape(),"color");
			
			// add listeners to location
			noshape.getVRMLExitInterface().addVRMLNode(sphere);
			noshape.getVRMLExitInterface().addVRMLNode(box);
			noshape.getVRMLExitInterface().addVRMLNode(cone);
			noshape.getVXMLExitInterface().addVXMLForm(vxmlForm);
			
			// second location
			Location shape = new Location("shape");

			sphere2.addVRMLEventListener(new ProximitySensorOut(),"exitTime");
			box2.addVRMLEventListener(new ProximitySensorOut(),"exitTime");
			cone2.addVRMLEventListener(new ProximitySensorOut(),"exitTime");                
			VXMLForm vxmlForm2 = (new VXMLForm("noshape"));
			vxmlForm2.addVXMLFieldListener(new VxmlShape(),"fshape");

			VXMLForm vxmlFormColor = (new VXMLForm("shapeform"));             
			vxmlFormColor.addVXMLFieldListener(new VxmlChangeShape(), "color");

			shape.getVRMLExitInterface().addVRMLNode(sphere2);
			shape.getVRMLExitInterface().addVRMLNode(box2);
			shape.getVRMLExitInterface().addVRMLNode(cone2);                  
			shape.getVXMLExitInterface().addVXMLForm(vxmlForm2);
			shape.getVXMLInterface().addVXMLForm(vxmlFormColor);

			// run VXML dialog
			VvingFactory.getVxmlCommander().runDialog("intro");
		} catch (VvingException e) {};

		// go to an initial location
		VvingFactory.getSuperLocation().gotoLocation("noshape");
		super.OnVRMLInitialized();
	}
}

4.3 Event handlers

Event handler consists of object, inherited from VRMLEventListener and VXMLFieldListener. Actions performed in VRML or VXML subsystem are realized via VvingFactory.getVxmlCommander() interface and VvingFactory.getVrmlCommander interface. At the end of the event handler should be transfer to next location. Example of the handler from "shapes".

class ProximitySensorOut extends VRMLEventListener {
	public void event(VrmlEvent e) {
		VxmlBrowserCommander vxmlCommander = SuperLocationFactory.getVxmlCommander();

		// fill vxml form
		vxmlCommander.setField("color","back");

		// go to new location
		SuperLocationFactory.getSuperLocation().gotoLocation("noshape");
	}           
}

5. Use cases

During developing the project we have directed design in a way to cover all possibilities in user applications. We had to think of many possibilities in user's applications.

We knew that we will have to demonstrate our framework in the application interactive gallery.

So we started scanning features useful in gallery for user. We developed two modes. The first mode was free walking in the gallery (walking in vrml, say command to voicexml, asking for information about pictures), the second one was the exam mode. In both modes we consider use of virtual guide, who was represented by a 3d robot model. As he would guide the user through the gallery he may communicate with him. For example the user would ask the robot „Where I can see Picasso painting?“. This feature we couldn't implement because we didn’t have enough time.

The version which we implemented provides free walking mode and exam mode. Both are implemented without virtual guide.

The exam proceeds this way: when user says “exam” a random picture number is generated, user is taken to this painting and there he will be asked “Who is the author of this painting?”. The user has three choices for answer a, b and c. The VoiceXML browser doesn’t include all words that we needed, especially names of auhtors. There is a possibility to improve dictionary of VoiceXML browser, but it is possible by use of the special software that we just didn't receive from IBM. IBM offered us that we can send them words that we need in dictionary, but we didn’t have a lot of time to do that. Finally we found that we didn’t need it, because of hard pronunciation of the authors’ names. After the user’s response the answer is evaluated and result is spoken. After that the new random number of picture is generated and procedure is repeated. The user can finish test by words “end” or “results”.

Free walking proceed that way: the user can walk in front of the picture by two ways. The first of them is by saying the number of picture, second one is by walking in front of picture in VRML. In both cases he learns information about picture. The other way of navigation is by commands “next” and “previous”. “next” means next picture, “previous” means previous picture.

The user can ask for help by command “help” in both modes.

The next use case of our framework is for example implementation of virtual railway station, where the user would use information services for find the connection, or to ask where is the platform. For such use case we would need connection to database for searching the connections.

One of interesting use cases is implementation of e-shop, but this application also expects database connection for easy manage of stocks (VRML models).

6. Future

It will be hard way to complete framework Vving to use in standard practice. We hope that our project selected the right way.

In chapter 2.4 we talked about xml format vvxml, which is in an early form and isn’t implemented yet. The draft of xml doesn’t count on for example ECMA scripts. ECMA scripts will improve the vvxml in many ways, especially scripts will be useful for complex applications like e-shop or frontend to information system. Now, it is still necessary code application logic in Java and duplicate variables of voicexml in application. Problem of duplicity should be solved in near future by modification of java interface of VoiceXML browser.

The next feature that should be implemented in near future is communication with databases. This feature will improve usability of framework for e-shops and information systems. The next task is to write applications as applets for viewing in browsers. We think that we choose the right technology – Java. Java is well suited to internet application development.

Conclusion

This project was really good experience for us, we met many technologies that we didn’t use before. The great opportunity was to communicate with people from big company – IBM and discuss the problems with them. Communication proceeded regularly and without problems on both sides.

Our goals in this project were achieved within the limits. These limits were mainly given by VoiceXML browser and its java interface jplusV. What we missed was some possibilities to stop processing VoiceXML document, what caused some problems e.g. in the situation, when user has beed moved to the image and informations about the image was played during this movement without possibility to simple wait until he/she arrieves to the image. Not easy was the variable setting in VoiceXML browser and getting back their values. So we actually have to keep them not only in VoiceXML but duplicated in the application as well.

Finally we would like to give thanks to everybody participating at this project, mainly to our head Ing. Adam Sporka and to Ing. Tomáš Macek from IBM.

Appendix 1: vvxml proposal

<!-- VVxml draft 0.5
-->

<!--

<vvxml> params: ver = "0.5" (required) Root Tag

child : <vxmlform>,<var>,<vrmlobject>,<vblock>,<vxmlevent> (many)

*****************
*** TOP-LEVEL ***
*****************


<vblock>   params: id = <vblock id>
child: <vrmlgoto>, <vrmljump>, <vrmlsend>, <vrmlrun>, <vxmlrun>, <vxmlabort>,<if>, <vxmlsetfield>, <vxmlgetvar>, <vrmlset>
usage: this tag is a container for action tags

<var>  params: name = <name string>
value = <value>(optional default:0)
child: none
usage: declares a variable

*** TOP-LEVEL / VXML ***

<vxmlevent> params: id = <vxml tag id>
type = {DOMActivate | DOMFocusIn | DOMFocusOut}
child:  <vblock>
usage: Event handler for vxml subsystem. Id corresponds to tag id in vxml document. DOMActivate gets called for forms, events DOMFocusIn and DOMFocusOut are fired by form fields.

<vxmlform> params: id = <vxml form id>
child: <vxmlmap>        
usage: Maps form fields in vxml subsystem to variables in vvxml. 

<vxmlmap>  params: vfield = <vxml field>
var = <user's var> //see tag var (optional)
child: <vblock>        
usage: Maps form field to vvxml variable.

<vxmlgetvar> params: id = <vxml variable id>
var = <user's var> (optional, default=id)
child: none
usage: Copies value of variable in vxml subsystem to user variable.

*** TOP-LEVEL / VRML ***

<vrmlobject> params: id = <vrml object>
child: <vrmlmap> <vrmlroute>
usage: Maps vrml fields to vvxml variables. Denotes object (defined by DEF) which fields will be mapped.

<vrmlmap> params: exposedField = <string>
var = <user's var> //
child: <vblock>
usage: Maps object field to vvxml variable.

<vrmlroute> params: event = <vrml event name>
type = <vrml type string>
child: <vblock>         
usage: Event handler for vrml subsystem.

**************
*** VBLOCK ***
**************

<if>       params: cond = <expression>
child: <elseif> <vblock> 
<elseif>   params: cond = <condition>
child: <vblock>
usage: Conditional execution of vblock.

*** VBLOCK / VRML ***

<vrmlgoto> params: dest = <vrml dest name>
child: <checkpoint> (zero or many)
usage: Interpolated movement of user avatar to specified place. It is possible to define waypoints.

<vrmljump> params: dest = <vrml dest name>
child: none
usage: Jumps user avatar to specified place.

<vrmlsend> params: dest = <vrml dest>
evt = <vrml event name>
type = <string>
value = <string>
usage: Sends an event to vrml subsystem.

<vrmlrun>  params: src = <vrml source>
child: none
usage: Loads and runs vrml document.

<vrmlset> params: dest = <vrml field>
type = <type>
value = <value>
usage: Sets a value of field in vrml subsystem.

*** VBLOCK / VXML ***

<vxmlrun>  params: src = <vxml source>
id = <vxml element id>
child: none
usage: Loads and runs vxml document.

<vxmlabort>
usage: Aborts vxml document execution.

<vxmlsetfield> params: id = <vxml field id>
value = <value>
usage: Sets field value in vxml form to specified value.

****************
*** VRMLGOTO ***
****************

<checkpoint> params: name = <vrml sens checkpoint>
child: none
usage: Defines waypoint in an interpolated movement (tag vrmlgoto).

-->

Appendix 2: UML static class diagram

vving

package vving

vving.vxml

package vving.vxml

vving.vrml

package vving.vrml