
Multimodal Interfaces, Sharon Oviatt, In The Human-Computer Interaction Handbook, Lawrence Erlbaum, 2002, 22 pp.
This article was a let-down for me. Given its publication in the Handbook of HCI, I expected a more balanced, more comprehensive review of multimodal interaction tools. For Oviatt, the term "interface" seems to stand exclusively for "input device". The option of multimodal output is not even mentioned until the closing statement. Furthermore, the article exhibits a strong slant towards architectures that incorporate speech recognition as one of their input channels. I am not sure whether this is because of a lack of research in other modalities or rather a result of the author's own bias. The discussion diverges at times into minute details of speech systems that I felt misplaced for a survey article. One of the overarching metaphors in the article was that multimodal interfaces afford human-like sensory perception to computers. Human perception though is inextricably intertwined with our memory and attention systems. These two important building blocks were completely left out.
While the inserted summary tables help to give the reader an overview of frequently used terms in the article, I think at least one pair of definitions is questionable: the distinction between active and passive input modes. Human gesturing behavior is often quite deliberate and intentional and serves the explicit purpose to communicate certain aspects of the speaker's utterance - especially in deixis. As such it should be classified as "active" (cf. McNeill's "Hand and Mind", which Oviatt cites multiple times).
"High fidelity simulation testing" seems to be a fancy label for Wizard of Oz testing. An important limitation to the applicability of this methodology is the relationship between the response time the wizard needs to select feedback and the expected/acceptable latency of the application for the user/subject. Human reaction time is a good match for speech interfaces, but may not be for other modalities. On page 12, evidence is presented that multimodal input is complementary rather than redundant. This weakens the previously stated claim that disambiguation is easier in multimodal interfaces. Multiple channels do provide more information, but if this information is about different aspects of user intention, inference across channels is far from trivial.
The silver lining here is the frequent reference to work in cognitive science that can (and must) inform future development in multimodal interfaces.
Interaction Techniques for ambiguity resolution in recognition-based interfaces, Jennifer Mankoff, Scott E. Hudson, Gregory D. Abowd, UIST 2000: ACM Symposium on User Interface Software and Technology, pp. 11-20
The authors describe OOPS, a system that encapsulates mediation strategies for recognition-based input devices. More than the particular practical value of the tool, the contribution of the article is its framework of terminology within which one can think about dealing with ambiguous and error-prone input. Discussion of particular function calls in their OOPS framework are level-of-detail mismatches compared to the rest of the paper (too detailed). The proposed mediation strategies appear to be context insensitive - they work only on a given atomic level (e.g., a word) without taking the larger structure in which the atom appears (e.g., a sentence) into account. The presented solution to deal with occlusion may introduce more problems than it solves - in a dense interface or document, moving elements is likely to cause other occlusion. If on the other hand the procedure is recursive, it could lead to a lengthy cascade of GUI reorganization that potentially changes the entire visual appearance of the interface for the duration of the dialog display.
Computer Vision for Interactive Computer Graphics, William T. Freeman, Yasunari Miyake, Ken-ichi Tanaka, David B. Anderson, Paul A. Beardsley, Chris N. Dodge, Michal Roth, Craig D. Weissman, William S. Yerazunis, Hiroshi Kage, Kazuo Kyuma, IEEE Computer Graphics and Applications, May 1998, pp. 42-53
The article presents simple, FAST computer vision algorithms for interactive UIs. Computer vision for HCI has different requirements from traditional application areas: results need to be available quickly, but the kind of information sought is often limited (e.g., no complete 3D reconstruction of a scene). Additionally, since a human is in the loop, feedback can be used to allow iteration/adaptation of software and user behavior.
The balance in presentation between mathematical methods and concrete application examples is quite effective as an "appetizer" - some links to textbooks for further exploration would have been useful. Here are two: an accessible introductory text in computer vision methods is "Machine Vision" by Jain, Kasturi, and Schunck (McGraw-Hill 1995). More in-depth treatment of current research problems, especially in 3D reconstruction, can be found in Computer Vision: A Modern Approach by Forsythe and Ponce (Prentice Hall 2002) The latter text requires a well equipped mental math tool box.
CV-based surgery is used as an early motivating example, which was quite scary for me. I'd rather entrust my health to physical manipulation based interfaces such as those developed by Ken Salisbury.
Just an idea: one could use image pyramids to compute multi-resolution classifiers from coarse to fine. Whenever the real-time system requires a response, one can return the last completed resolution calculation as the current "best guess". Have anyone done this yet?
A Design Tool for Camera-based Interaction, Jerry Alan Fails and Dan R. Olsen, CHI 2003: ACM Conference on Human Factors in Computing Systems, pp. 449-56
Crayon is a computer vision tool for building color-based classifiers for object tracking applications. It demonstrates the kind of productivity gains that are possible when HCI principles of iterative design and rapid prototyping are injected to a previously only technically oriented domain.
I do not buy the author's argument about most ML algorithms being completely impractical for real-time interaction. Their conclusion is solely based on their particular choice of performing per-pixel classifications with a large feature vector. Building more knowledge about potentially useful features into the system a priori -- instead of learning appropriate filters/kernels on the fly for every image -- could lead to dramatic shifts in performance of other methods. Crayon uses R,G,B,H,S,V values per pixel as the fundamental features - note that these are six features for only three independent dimensions. When expanded over the image regions a LOT of redundant information is stored in each classifier.
Decision trees frequently suffer from overfitting - which becomes an issue if training data sets don't accurately reflect testing situations. It seems that the Crayon approach would be most accurate if the end-user builds her own classifier in her actual application setting, instead of the UI designer building a classifier in a potentially very different lab environment. I would have liked to see a running example of how the classification step fits into a complete vision-based application.
The general painting metaphor seems to be quite similar to Adobe PhotoShop's "Extract" function. Maybe a rough boundary painting approach followed by region filling would be more successful than only looking at the pixels underneath the user's crayon trace.