Recording: Videos of most talks are available from and YouTube.
Talks of Chris Burges and Richard Socher have not been recorded because of a technical problem with Weyond, the recording company.


  • 7.30-7.40 Introduction.
  • 7.40-8.20 Learning Natural Language from its Perceptual Context. Raymond Mooney.
  • 8.20-9.00 Learning Dependency-Based Compositional Semantics. Percy Liang.
  • 9.00-9.10 Coffee break.
  • 9.10-9.50 How to Recognize Everything. Derek Hoiem. Pascal2 invited talk.
  • 9.50-10.10 Learning What Is Where from Unlabeled Images. A. Chandrashekar and L. Torresani(canceled)
  • 9.50-10.10 Poster spotlights.
  • 10.10-10.30 Posters and group discussions.
  • 10.30-16.00 Break.
  • 16.00-16.40 From Machine Learning to Machine Reasoning. Léon Bottou.
  • 16.40-17.20 Towards More Human-like Machine Learning of Word Meanings. Josh Tenenbaum.
  • 17.20-17.40 Learning Semantics of Movement. Timo Honkela et al.
  • 17.40-17.50 Coffee break.
  • 17.50-18.30 Towards Extracting Meaning from Text, and an Autoencoder for Sentences. Chris Burges.
  • 18.30-19.10 Recursive Deep Learning in Natural Language Processing and Computer Vision. Richard Socher.
  • 19.10-20.00 Posters and group discussions.

Invited talks:

  • From Machine Learning to Machine Reasoning.
    L. Bottou. (Microsoft)
    Abstract: A plausible definition of “reasoning” could be “algebraically manipulating previously acquired knowledge in order to answer a new question”. This definition covers first-order logical inference or probabilistic inference. It also includes much simpler manipulations commonly used to build large learning systems. For instance, we can build an optical character recognition system by first training a character segmenter, an isolated character recognizer, and a language model, using appropriate labeled training sets. Adequately concatenating these modules and fine tuning the resulting system can be viewed as an algebraic operation in a space of models. The resulting model answers a new question, that is, converting the image of a text page into a computer readable text. This observation suggests a conceptual continuity between algebraically rich inference systems, such as logical or probabilistic inference, and simple manipulations, such as the mere concatenation of trainable learning systems. Therefore, instead of trying to bridge the gap between machine learning systems and sophisticated “all-purpose” inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to training systems, and build reasoning capabilities from the ground up.
  • How to Recognize Everything?
    D. Hoiem. (UIUC) — Pascal2 Invited Speaker
    Abstract: Our survival depends on recognizing everything around us: how we can act on objects, and how they can act on us. Likewise, intelligent machines must interpret each object within a task context. For example, an automated vehicle needs to correctly respond if suddenly faced with a large boulder, a wandering moose, or a child on a tricycle. Such robust ability requires a broad view of recognition, with many new challenges. Computer vision researchers are accustomed to building algorithms that search through image collections for a target object or category. But how do we make computers that can deal with the world as it comes? How can we build systems that can recognize any animal or vehicle, rather than just a few select basic categories? What can be said about novel objects? How do we approach the problem of learning about many related categories? We have recently begun grappling with these questions, exploring shared representations that facilitate visual learning and prediction for new object categories. In this talk, I will discuss our recent efforts and future challenges to enable broader and more flexible recognition systems.
  • Learning Dependency-Based Compositional Semantics.
    P. Liang. (Stanford)
    Abstract: The semantics of natural language has a highly-structured logical aspect. For example, the meaning of the question “What is the third tallest mountain in a state not bordering California?” involves superlatives, quantification, and negation. In this talk, we develop a new representation of semantics called Dependency-Based Compositional Semantics (DCS) which can represent these complex phenomena in natural language. At the same time, we show that we can treat the DCS structure as a latent variable and learn it automatically from question/answer pairs. This allows us to build a compositional question-answering system that obtains state-of-the-art accuracies despite using less supervision than previous methods. I will conclude the talk with extensions to handle contextual effects in language.
  • Learning Natural Language from its Perceptual Context.
    R. Mooney. (UT at Austin)
    Abstract: Machine learning has become the best approach to building systems that comprehend human language. However, current systems require a great deal of laboriously constructed human-annotated training data. Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we have developed systems that learn to sportscast simulated robot soccer games and to follow navigation instructions in virtual environments by simply observing sample human linguistic behavior. This work builds on our earlier work on supervised learning of semantic parsers that map natural language into a formal meaning representation. In order to apply such methods to learning from observation, we have developed methods that estimate the meaning of sentences from just their ambiguous perceptual context.
  • Recursive Deep Learning in Natural Language Processing and Computer Vision.
    R. Socher. (Stanford)
    Abstract: Hierarchical and recursive structure is commonly found in different modalities, including natural language sentences and scene images. I will present some of our recent work on three recursive neural network architectures that learn meaning representations for such hierarchical structure. These models obtain state-of-the-art performance on several language and vision tasks. The meaning of phrases and sentences is determined by the meanings of its words and the rules of compositionality. We introduce a recursive neural network (RNN) for syntactic parsing which can learn vector representations that capture both syntactic and semantic information of phrases and sentences. Our RNN can also be used to find hierarchical structure in complex scene images. It obtains state-of-the-art performance for semantic scene segmentation on the Stanford Background and the MSRC datasets and outperforms Gist descriptors for scene classification by 4%. The ability to identify sentiments about personal experiences, products, movies etc. is crucial to understand user generated content in social networks, blogs or product reviews. The second architecture I will talk about is based on recursive autoencoders (RAE). RAEs learn vector representations for phrases sufficiently well as to outperform other traditional supervised sentiment classification methods on several standard datasets. We also show that without supervision RAEs can learn features which outperform previous approaches for paraphrase detection on the Microsoft Research Paraphrase corpus. This talk presents joint work with Andrew Ng and Chris Manning.
  • Towards Extracting Meaning from Text, and an Autoencoder for Sentences.
    C. Burges. (Microsoft)
    Abstract: I will begin with a brief overview of some of the projects underway at Microsoft Research Redmond that are aimed at extracting meaning from text. I will then describe a data set that we are making available and which we hope will be useful to researchers who are interested in semantic modeling. The data is composed of sentences, each of which has several variations: in each variation, one of the words has been replaced by one of several alternatives, in such a way that the low order statistics are preserved, but where a human can determine that the meaning of the new sentence is compromised (the “sentence completion” task). Finally I will describe an autoencoder for sentence data. The autoencoder learns vector representations of the words in the lexicon and maps sentences to fixed length vectors. I’ll describe several possible applications of this work, show some early results on learning Wikipedia sentences, and end with some speculative ideas on how such a system might be leveraged in the quest to model meaning.
  • Towards More Human-like Machine Learning of Word Meanings.
    J. Tenenbaum. (MIT)
    Abstract: How can we build machines that learn the meanings of words more like the way that human children do? I will talk about several challenges and how we are beginning to address them using sophisticated probabilistic models. Children can learn words from minimal data, often just one or a few positive examples (one-shot learning). Children learn to learn: they acquire powerful inductive biases for new word meanings in the course of learning their first words. Children can learn words for abstract concepts or types of concepts that have no little or no direct perceptual correlate. Children’s language can be highly context-sensitive, with parameters of word meaning that must be computed anew for each context rather than simply stored. Children learn function words: words whose meanings are expressed purely in how they compose with the meanings of other words. Children learn whole systems of words together, in mutually constraining ways, such as color terms, number words, or spatial prepositions. Children learn word meanings that not only describe the world but can be used for reasoning, including causal and counterfactual reasoning. Bayesian learning defined over appropriately structured representations — hierarchical probabilistic models, generative process models, and compositional probabilistic languages — provides a basis for beginning to address these challenges.

Contributed talks:

  • Learning Semantics of Movement. (pdf)
    T. Honkela, O. Kohonen, J. Laaksonen, K. Lagus, K. Förger, M. Sjörberg, T. Takala, H. Valpola & P. Wagner. (Aalto University)
    Abstract: In this presentation, we consider how to computationally model the interrelated processes of understanding natural language and perceiving and producing movement in multimodal real world contexts. Movement is the specific focus of this presentation for several reasons. For instance, it is a fundamental part of human activities that ground our understanding of the world. We are developing methods and technologies to automatically associate human movements detected by motion capture and in video sequences with their linguistic descriptions. When the association between human movement and their linguistic descriptions has been learned using pattern recognition and statistical machine learning methods, the system is also used to produce animations based on written instructions and for labeling motion capture and video sequences. We consider three different aspects: using video and motion tracking data, applying multi-task learning methods, and framing the problem within cognitive linguistics research.
  • Learning What Is Where from Unlabeled Images. (pdf)
    A. Chandrashekar & L. Torresani. (Dartmouth College)
    Abstract: “What does it mean, to see? The plain man’s answer would be, to know what is where by looking.” This famous quote by David Marr sums up the holy grail of vision: discovering what is present in the world, and where it is, from unlabeled images. To tackle this challenging problem we propose a generative model of object formation and present an efficient algorithm to automatically learn the parameters of the model from a collection of unlabeled images. Our algorithm discovers the objects and their spatial extents by clustering together images containing similar foregrounds. Unlike prior work, our approach does not rely on brittle low-level segmentation methods applied as a first step before the clustering. Instead, it simultaneously solves for the image clusters, the foreground appearance models and the spatial subwindows containing the objects by optimizing a single likelihood function defined over the entire image collection.


  • Action Induced Phrase Semantics. (pdf)
    M. Szummer. (Microsoft)
  • Can Semantics Facilitate Language Learning? (pdf)
    D. Angluin & L. Becerra-Bonache. (Yale University & Université Jean Monnet)
  • Exploiting Context-based Information for Scene Categorization. (pdf)
    G. Mesnil, S. Rifai, X. Glorot, P. Vincent, Y. Bengio & A. Bordes. (Université de Montréal & CNRS — UTC)
  • Learning Semantics for Automated Reasoning. (pdf)
    D. Kühlwein, E. Tsivtsivadze, J. Urban & T. Heskes. (Radboud University Nijmegen)
  • Learning to Recover Meaning from Unannotated Conversational Interactions. (pdf)
    Y. Artzi & L. Zettlemoyer. (University of Washington)
  • Learning Taxonomies from Multi-relational Data via Hierarchical Link-Based Clustering. (pdf)
    M. Nickel & V. Tresp. (Ludwig-Maximilians-University Munich & Siemens AG)
  • Meaning Representations in Statistical Word Alignment. (pdf)
    T. Okita. (Dublin City University)
  • Shared Components Topic Models with Application to Selectional Preference. (pdf)
    M. Gormley, M. Drezde, B. Van Durme & J. Eisner. (Johns Hopkins University)
  • Towards Interactive Relational Reinforcement Learning of Concepts. (pdf)
    M. Nickles & A. Rettinger. (Technical University of Munich & Karlsruhe Institute of Technology)
  • Towards Natural Instruction based Learning. (pdf)
    D. Goldwasser & D. Roth. (UIUC)