Tag Archives: Computational/Corpus
7/14 How the Brain Accommodates Variability in Linguistic Representations

July 14, 2013
Aud C, Angell Hall

Organizer Contact: T. Florian Jaeger (fjaeger@bcs.rochester.edu)

Click here for workshop website.

, , , , ,


Computational Modeling of Sound Change

James Kirby – University of Edinburgh
Morgan Sonderegger – McGill University
Course time: Tuesday/Thursday 3:30-5:20 pm
2347 Mason Hall

See Course Description

Decades of empirical research have led to an increasingly nuanced picture of the nature of phonetic and phonological change, incorporating insights from speech production and perception, cognitive biases, and social factors. However, there remains a significant gap between observed patterns and proposed mechanisms, in part due to the difficulty of conducting the type of controlled studies necessary to test hypotheses about historical change. Computational and mathematical models provide an alternative means by which such hypotheses can be fruitfully explored. With an eye towards Box’s dictum (all models are wrong, but some are useful), this course asks: how can computational models be useful for understanding why phonetic and phonological change occurs?  Students will study the growing and varied literature on computational and mathematical modeling of sound change that has emerged over the past decade and a half, including models of phonetic change in individuals over the lifespan, phonological change in speech communities in historical time, and lexical diffusion. Discussion topics will include the strengths and weaknesses of different approaches (e.g.simulation-based vs. mathematical models); identifying which modeling frameworks are best suited for particular types of research questions; and methodological considerations in modeling phonetic and phonological change. For this course, some background in probability theory, single-variable calculus, and/or linear algebra is helpful but not required.

, , , , ,


Computational Psycholinguistics

John Hale – Cornell University
Lars Konieczny – University of Freiburg
Course time: Monday/Wednesday 9:00-10:50 am
2330 Mason Hall

See Course Description

This course examines cognitive models of human sentence comprehension. Such models are programs that express psycholinguistic theories of how people unconsciously put together words and phrases in order to make sense of what they hear (or read). They hold out the promise of rigorously connecting behavioral measurements to broader theories, for instance theories of natural language syntax or cognitive architecture. The course brings students up to speed on the role of computer models in cognitive science generally, and situates the topic in relation to neighboring fields such as psychology and generative grammar. Students master several different viewpoints on what it might mean to “attach” a piece of phrase structure. Attendees will get familiar with notions of experience, probability and information theory as candidate explanations of human sentence processing difficulty. This course has no prerequisites although exposure to artificial intelligence, generative grammar and cognitive psychology will help deepen the experience.

, , ,


Corpus-based Linguistic Research: From Phonetics to Pragmatics

Mark Liberman – University of Pennsylvania
Course time: Monday/Wednesday 1:30-3:20 pm
Aud C

See Course Description

Corpus-based Linguistic Research: From Phonetics to Pragmatics

Course website: http://languagelog.ldc.upenn.edu/myl/lsa2013/

Big, fast, cheap, computers; ubiquitous digital networks; huge and
growing archives of text and speech; good and improving algorithms for
automatic analysis of text and speech: all of this creates a
cornucopia of research opportunities, at every level of linguistic
analysis from phonetics to pragmatics. This course will survey the
history and prospects of corpus-based research on speech, language,
and communication, in the context of class participation in a series
of representative projects. Programming ability, though helpful, is
not required.

This course will cover:

* How to find or create resources for empirical research in linguistics
* How to turn abstract issues in linguistic theory into concrete
questions about linguistic data
* Problems of task definition and inter-annotator agreement
* Exploratory data analysis versus hypothesis testing
* Programs and programming: practical methods for searching,
classifying, counting, and measuring
* A survey of relevant machine-learning algorithms and applications

We will explore these topics through a series of empirical research
exercises, some planned in advance and some developed in response to
the interests of participants.

There will be some connections to the ICPSR Summer Program in
Quantitative Methods of Social Research:
http://www.icpsr.umich.edu/icpsrweb/sumprog/

, ,


Introduction to Computational Linguistics

Jason Eisner – Johns Hopkins University
Course time: Tuesday/Thursday 1:30-3:20 pm AND Friday, June 28 1:00-5:00 pm
1401 Mason Hall

See Course Description

This class presents fundamental methods of computational linguistics. We will develop probabilistic models to describe what structures are  likely in a language.  After estimating the parameters of such models,  it is possible to recover underlying structure from surface  observations. We will examine algorithms to accomplish these tasks.

Specifically, we will focus on modeling
  • trees (via probabilistic context-free grammars and their relatives)
  • sequences (via n-gram models, hidden Markov models, and other probabilistic finite-state processes)
  • bags of words (via topic models)
  • lexicons (via hierarchical generative models)
We will also survey a range of current tasks in applied natural  language processing.  Many of these tasks can be addressed with  techniques from the class.
Some previous exposure to probability  and programming may be helpful.  However,  probabilistic modeling  techniques will be carefully introduced, and programming expertise will  not be required.  We will use a very high-level language (Dyna) to  describe algorithms and visualize their execution.
Useful related courses include Machine Learning, Python 3 for  Linguists, Corpus-based Linguistic Research, and Computational  Psycholinguistics.

,


Lexicography in Natural Language Processing

Orin Hargraves – Independent Scholar
Course time: Tuesday/Thursday 9:00-10:50 am
2325 Mason Hall

See Course Description

Determining what words mean is the core skill and practice of lexicography. Determining what words mean is also a central challenge in natural language processing (NLP), where it is usually classed under the exercise of word sense disambiguation (WSD). Until the late 20th century, lexicography was dominated by scholars with backgrounds in philosophy, literature, and other humanistic disciplines, and the writing of dictionaries was based strongly on intuition, and only secondarily on induction from the study of examples of usage. Linguistics, in this same period, establish itself as a discipline with strong scientific credentials. With the development of corpora and other computational tools for processing text, dictionary makers recognized first the value, and soon the indispensability, of using evidence-based data to develop dictionary definitions, and this brought them increasingly into contact with computational linguists. The developers of computational linguistic tools and resources eventually turned their attention back to the dictionary and found that it was a document that could be exploited for use in the newly emerging fields of linguistic inquiry that computation made possible: NLP, artificial intelligence, machine learning, and machine translation. This course will explore the computational tools that lexicographers use today to write dictionaries, and the ways in which computational linguists use dictionaries in their pursuits. The aim is to give students an appreciation of the unexploited opportunities that dictionary databases offer to NLP, and of the challenges that stand in the way of their exploitation. Students will have an opportunity to explore the ways in which dictionaries may aid or hinder automatic WSD, and they will be encouraged to develop their own models for the use of dictionary databases in NLP.

Students must have native-speaker fluency in English. Thorough knowledge of Englsih grammar and morphology is an advantage, as is knowledge of the rudiments of NLP.

, ,


Machine Learning

Steve Abney – University of Michigan
Course time: Monday/Wednesday 11:00 am – 12:50 pm
1401 Mason Hall

See Course Description

This course provides a general introduction to machine learning. Unlike results in learnability, which are very abstract and have limited practical consequences, machine learning methods are eminently practical, and provide detailed understanding of the space of possibilities for human language learning.

Machine learning has come to dominate the field of computational linguistics: virtually every problem of language processing is treated as a learning problem.  Machine learning is also making inroads into mainstream linguistics, particularly in the area of phonology. Stochastic Optimality Theory and the use of maximum entropy models for phonotactics may be cited as two examples.

The course will focus on giving a general understanding of how machine learning methods work, in a way that is accessible to linguistics students. There will be some discussion of software, but the focus will be on understanding what the software is doing, not in the details of using a particular package.

The topics to be touched on include classification methods (Naive Bayes, the perceptron, support vector machines, boosting, decision trees, maximum entropy classifiers) and clustering (hierarchical clustering, k-means clustering, the EM algorithm, latent semantic indexing), sequential models (Hidden Markov Models, conditional random fields) and grammatical inference (probabilistic context-free grammars, distributional learning), semisupervised learning (self-training, co-training, spectral methods) and reinforcement learning.

, , ,


Praat Scripting

Kevin McGowan – Rice University
Course time:
Tuesday/Thursday 11:00 am – 12:50 pm, MLB OR
Monday/Wednesday 1:30 pm – 3:20 pm, 2353 Mason Hall

See Course Description

This course introduces basic automation and scripting skills for linguists using Praat. The course will expand upon a basic familiarity with Praat and explore how scripting can help you automate mundane tasks, ensure consistency in your analyses, and provide implicit (and richly-detailed) methodological documentation of your research.  Our main goals will be:

    1.  To expand upon a basic familiarity with Praat by exploring the software’s capabilities and learning the details of its scripting language.

    2.  To practice a set of scripting best practices to help you not only write and maintain your own scripts but evaluate scripts written by others.

The course assumes participants have read and practiced with the Intro from Praat’s help manual. Topics to be covered include:

    o Working with the Objects, Editor, and Picture windows

    o Finding available commands

    o Creating new commands

    o Working with TextGrids

    o Conditionals, flow control, and error handling

    o Using strings, numbers, formulas, arrays, and tables

    o Automating phonetic analysis

    o Testing, adapting, and using scripts from the internet

, , ,


Python 3 for Linguists

Damir Cavar – Eastern Michigan University
Malgorzata E. Cavar – Eastern Michigan University
Course time: Monday/Wednesday 9:00-10:50 am, MLB OR
Tuesday/Thursday 11:00 am – 12:50 pm, 2347 Mason Hall

See Course Description

This course introduces basic programming and scripting skills to linguists using the Python 3 programming language and common development environments. Our main goals are:

- to offer an entry point to programming and computation for humanities students, and whoever is interested

- to do so without requiring any previous computer or IT knowledge (except basic computer experience and common lay-person computer knowledge).

The course covers in eight sessions the interaction with the Python programming environment, an introduction to programming, and an introduction to linguistically relevant text and data processing algorithms, including quantitative and statistical analyses, as well as qualitative and symbolic methods.

Existing Python code libraries and components will be discussed, and practical usage examples given. The emphasis in this course is on being creative with a programming language, and teaching content that is geared towards specific tasks that linguists are confronted with, where computation of large amounts of data or time consuming annotation and data manipulation tasks are necessary. Among the tasks we consider essential are:

- reading text and language data from- and writing to files in various encodings, using different orthographic systems and standards, corpus encoding formats and technologies (e.g. XML),

- generating and processing of word lists, linguistic annotation models, N-gram models, frequency profiles to study quantitative and qualitative aspects of language, for example, variation in language, computational dialectology, similarity or dissimilarity at different linguistic levels,

- symbolic processing of regular grammar rules to be used in finite state automata for processing of phonotactic information or morphology, but also context free grammars and parsers for syntactic analyses, and higher level grammar formalisms, and the use of these grammars and language processing algorithms.

, ,


Quantitative and Computational Phonology

Bruce Hayes – University of California, Los Angeles
Course time: Tuesday/Thursday 9:00-10:50 am
2306 Mason Hall

See Course Description

In the grammar architecture of classical Optimality Theory (Prince and Smolensky 1993), constraints are ranked and the grammar generates exactly one winner per input. Phonologists have proposed instead that we should consider models in which the constraints, rather than being ranked, bear weights (real numbers, intuitively related to constraint strength). Weights are employed to calculate probabilities for all members of the candidate set.

Such quantitative grammars open up new research possibilities for constraint-based phonology:

(a) Modeling free variation and the multiple factors that shift the statistical distribution of outputs across contexts;

(b) Modeling gradient intuitions (intermediate well-formedness, ambivalence among output choices);

(c) Modeling quantitative lexical patterns and how they are characteristically mimicked in experiments where native speakers are tested on their phonological knowledge;

(d) Modeling phonological learning:  even where in areas where the ambient language doesn’t vary at all, the child’s conception of what is likely to be the correct grammar of it will change (approaching certainty) as more data are taken in; modeling can trace this process.

This course will be an introduction to these models and research areas. It will emphasize learning by doing. Participants will use software tools that embody the theories at hand and will examine and model data from a variety of digital corpora. The course will not cover computational phonology per se, but it will cover enough computation to give participants a good understanding of the tools they are using. Pre-requisite for this course: a course in phonology.

,


Social Media as Linguistic Data

John Paolillo – Indiana University
Course time: Tuesday/Thursday 9:00-10:50 am
2347 Mason Hall

See Course Description


The “information age” has brought with it an explosion of new kinds of communication, from electronic mail to discussion forums, chat, weblogs, texting, video sharing and many other hybrid modes. Millions of people participate on a daily basis in these “Social Media”, presenting new opportunities and challenges for linguistic research. Social media often offer readily available data, allowing both the content and context of ordinary communication to be studied as it never has before. At the same time, the scale of the available data, its sometimes uncertain provenance, and the constantly evolving status of the supporting media raise significant challenges for analysis. This course addresses the analysis of language in social media, through systematic exploration of current research literature on social media, focusing especially on the uses of computational techniques for the analysis of both language and context.


Structure and Evolution of the Lexicon

Janet Pierrehumbert – Northwestern University
Course time: Tuesday/Thursday 11:00 am – 12:50 pm
2353 Mason Hall

See Course Description

This class will explore the basic principles that create and sustain the richness of the lexicon in human languages. We will consider how new words are created, how they are learned, and how they are replicated through social interactions in human communities. Empirical data will be drawn from classical sources, from language on the Internet, and from computer-based “games with a purpose”. Using concepts from research on population biology and social dynamics, we will also discuss mathematical approaches to modeling the life and death of words.

, , , , ,


Usage-based Models of First and Second Language Acquisition

Nick Ellis – University of Michigan
Course time: Tuesday/Thursday 1:30-3:20 pm
2336 Mason Hall

See Course Description

This course develops a constructionist approach to First and Second Language Acquisition (L1A, L2A). It presents psycholinguistic and corpus linguistic evidence for L2 constructions and for the inseparability of lexis, grammar, and semantics. It outlines a psycholinguistic theory of language learning following general cognitive principles of category learning, with schematic constructions emerging from usage. It reviews how the following factors jointly determine how a construction is learned: (1) the exemplar frequencies and their Zipfian distribution; (2) the salience of their form; (3) the significance of their functional interpretation; (4) the exemplars’ similarity to the construction prototype; and (5) the reliability of these form-function mappings. It tests these proposals against large corpora of usage and longitudinal corpora of L1 and L2 learner language using statistical and computational modelling. It considers the psychology of transfer and learned attention in L2A in order to understand how L2A differs from L1A in that it involves reconstructing language, with learners’ expectations and attentional biases tuned by experience of their L1. A central theme of the course is that patterns of language usage, structure, acquisition, and change are emergent, and that there is value in viewing Language as a Complex Adaptive System.

Week 1: Constructions, their cognition and acquisition

Week 2: A frequency-informed construction grammar of English usage

Week 3: Construction learning in L1A and L2A longitudinal corpora

Week 4: L2A, learned attention, and transfer and their implications for instruction.

Course Areas: Language Acquisition, Semantics/Pragmatics, Psycholinguistics, Corpus Linguistics, Cognitive Linguistics

, , , ,