November 1, 2004

HCI Comments X

SpeechActs: A Spoken-Language Framework, Paul Martin, Frederick Crabbe, Stuart Adams, Eric Baatz, Nicole Yankelovich, IEEE Computer, July 1996, pp. 33-40

Note: I am not very excited about speech technology in general, which somewhat tinges my answers to this reading. I am not convinced that the translation problem from natural language (speech) to some form of machine interpretable format is in fact one of the major challenges in HCI. Speech is an effective means for communication between a small number of co-located humans in the absence of other communication channels. That does not make it an automatic good candidate for HCI (cf. my stance on imitating human-to-human interaction methods in HCI in general from the last reading response).

SpeechAct's chosen scenario is information delivery for business travelers. The presented example fails to convince me of the usefulness of such a system. Granted, the article was written in 1996 - but where are business travelers faced with a situation that allows for phone calls but not internet access today? In hotel rooms, phones have data ports. In residences, you can unplug a phone and plug in your modem cord. Business-oriented cell phones can be used as modems through their infrared or bluetooth ports. That leaves pay phones - which are getting harder to find by the day. In addition, some of the tasks performed by SpeechActs do not lend themselves to auditory presentation. The user has to have present her mental model of the SpeechActs application so she knows what she _can_ ask and at the same time keep track of all the information presented so she knows what she _wants_ to ask. Imagine how long synchronizing schedules will take if there are more than two meeting participants and each participant already has a packed schedule.

The authors' primary goal was to build a speech application toolkit for software developers that do not have expertise in speech or natural language. Constructing the unified grammar seems to be quite a daunting task for such linguistically "naive" developers. On the positive side, the authors were careful to construct a future-proof software system by stressing independence from particular recognizer/TTS implementations and by supporting multiple applications to service voice requests. They also acknowledge that some of the challenges for speech systems are not related to technical implementation, but rather with human expectations. Prior work was not surveyed in enough detail to judge the specific contributions that SpeechActs made.


The Audio Notebook, Lisa Stifelman, Barry Arons, Chris Schmandt, CHI2001: ACM Conference on Human Factors in Computing Systems, pp. 182-9

Here a more promising application area for voice technology is demonstrated: many situations exist where capturing an original audio stream is quite important because it comes from an authoritative source and will only be produced by that source once. Reviewing that original recording in random access fashion is complicated an frustrating with existing technologies used by the target audience (students&reporters; tape recorders). The abstract problem the paper addresses is automatic semantic segmentation of time-based media.
The authors provide an intriguing solution by augmenting a familiar interface that most members of the target audience already use - the paper notepad. A range of uses is supported to allow different interaction styles with the audio notebook: users can continue previous note taking activity without having to adjust at all - or change what and how information is written down to further simplify review later on.

The designers chose to employ the audio notebook both as the input device during note taking and as the output device during review. The requirements of these two processes can be quite different so we should not assume that a single interface will present an optimal solution. My personal preference would be for a central storage server that unites information from multiple input devices. This way one could recall the recording, hand-written notes, but also additional documents like lecture slides and pdf articles from one device connected to the central information server. The audio scrollbar is a low-bandwidth interface - low in information content and resolution. Also, phrase-snapping and segmentation make the audio scrollbar display non-linear, which complicates user predictions how far in time the audio will jump when selecting a different LED. A graphical representation of the audio stream with additional segmentation mechanisms would provide for richer interaction. Allowing other applications to access the recorded voice data on a central storage server would also enable post-processing to improve the fidelity of the audio signal.

The author's solution to the problem of incorrect segmentation by the topic suggestion algorithm is not very satisfying. If the audio will be reviewed multiple times, some direct user intervention to correct segmentation can be valuable. This would once again be a relatively easy task if other software could access the audio recordings. For music, tools like Steinberg's Recycle or the FruityLoops BeatSlicer perform semi-automatic user-correctable segmentation.

The long duration of the field test resulted in very rich usage data. Shortcomings and direction for future work were lacking.

(It would be interesting to port this work to a Tablet PC - here we already have all the required hardware, save for a decent microphone, and additional processing capabilities. Has anyone done this?)


A Confederation of Tools for Capturing and Accessing Collaborative Activity, Scott Minneman, Steve Harrison, Bill Janssen, Gordon Kurtenbach, Thomas Moran, Ian Smith, Bill van Melle, MM 1995: ACM Conference on Human Factors in Computing Systems, pp. 523-34

Coral, a suite of tools to deal with time-based media, has three foci: capturing interaction unobtrusively, indexing the recordings, and accessing the the recordings. A particular, narrow application domain - supporting casual group interaction - is picked to ground the research in real-world requirements. The authors present a useful taxonomy of indices (segmentation marks): intentional annotations, side-effect indices, derived indices and post hoc indices. Furthermore, capture and access situations are distinguished clearly, with completely different hardware supporting each stage (in contrast to the audio notebook). The loose collection of tools comprising Coral is based around shared communication protocols and interfaces - it is easily extensible. Building access tools is mentioned as the area where most work remains to be done. Bits and pieces for UIs that deal with time-based media exist in hardware and software used in audio and video editing (time lines, jog shuttles, interval-based selection, multi tracking). Uniting them in a common time-based framework would be worthwhile.

Posted by Bjoern Hartmann at November 1, 2004 4:03 AM