\documentclass[11pt]{article} \setlength{\oddsidemargin}{0.0truein} \setlength{\evensidemargin}{0.0truein} \setlength{\textwidth}{6.5truein} \setlength{\topmargin}{0.0truein} \setlength{\textheight}{9.0truein} \setlength{\headsep}{0.0truein} \setlength{\headheight}{0.0truein} \setlength{\topskip}{10.0pt} \setlength{\parskip}{5mm} \usepackage{url} \usepackage{amsmath} \usepackage{amssymb} \pagestyle{empty} \begin{document} \begin{center} \textbf{\Large{\textsc{STANFORD UNIVERSITY}}}\\[5pt] \textbf{\Large{\textsc{DEPARTMENT OF STATISTICS}}}\\[5pt] \Large{\textsc{DEPARTMENTAL SEMINAR}} \end{center} \begin{center} 4:15 p.m., Tuesday, March 13, 2007\\ Sequoia Hall Room 200\\ (Cookies at 3:45 in 1st Floor Lounge) \end{center} \begin{center} \textsl{David G.~Stork} \\ Chief Scientist, Ricoh Innovations \\ Visiting Lecturer, Department of Statistics, Stanford University \end{center} \begin{center} \subsection*{Toward a statistical theory of data acquisition} \end{center} \noindent There are deep theoretical justifications and compelling experimental verification that there is no \lq\lq best\rq\rq\ general method for statistical pattern classification and, further, that classifiers perform better the larger their training sets. Even simple unbiased classifiers, when trained with sufficiently large training sets, can outperform more sophisticated classifiers. These facts imply that the most promising avenues for research in pattern classification are no longer in developing refinements to general classification methods themselves, but rather in developing novel, efficient, and accurate methods for collecting, labelling and \lq\lq truthing\rq\rq\ large data sets for training simple, scalable classifiers. \noindent This talk will review the foundations of a statistical theory of data aquisition, including problems such as acquiring data under cost constraints, estimating the accuracy and reliability of contributors, organizing the self-policing among data contributors, and identifying \lq\lq malicious\rq\rq\ contributors. It will describe these and other challenges and opportunities associated with novel methods of data aquisition, such as the Open Mind Initiative, in which non-experts openly contribute data over the internet. This talk will explore the relationship of this nascent theory of data aquisition to polling theory, experimental design, and interactive learning, and it will conclude by describing a number of open research problems. \medskip \centerline{Joint work with Chuck Lam} \noindent {\bf David G.~Stork} is Chief Scientist of Ricoh Innovations and Visting Lecturer in Statistics at Stanford University, where he will teach {\bf Stat 328}, \lq\lq Statistical theory of data acquisition,\rq\rq\ this spring quarter. He has held academic posts in eight different disciplines, served on five editorial boards, and holds 35 patents. His roughly 120 scholarly publications are in theoretical mechanics, human visual perception, pattern classification, machine learning, computer lipreading, theory of concurrency, optical design, and image processing. Most recently he has pioneered and lectured widely on the use of computer vision methods for the analysis of Renaissance and Baroque master paintings. His five books include {\bf Pattern Classification} (2nd ed.) with R.~Duda and P.~Hart, and {\bf HAL\rq s Legacy}, the companion to his PBS television documentary about the famous computer in {\em 2001: A Space Odyssey}. \end{document}