Statistical Natural Language Processing

CIS*6650_01 (Fall 2011)


            Instructor: Fei Song  (Reynolds 215, ext. 58067)


            Office Hours: 3:30 – 4:30 pm on Tuesday and Thursday

            Web Page:




Statistical language modeling is an interdisciplinary area among Probability and Statistics, Information Theory, Linguistics, and Computer Science.  It has been applied successfully to a number of problems such as Speech Recognition, Part-of-Speech Tagging, Named Entity Recognition, and Machine Translation.  Recently, it has also been applied to Biological Data Analysis.  This course will provide an introduction to the emerging field, with emphasis on major techniques and their applications.  In addition to lectures, the students will be required to review the current literature and present two papers in the class.  They are also required to implement a particular technique and apply it to a real world problem.  The following is a list of topics we intend to cover in this course:


-          Foundations: basic concepts from Probability and Statistics, Information Theory, Information Retrieval, and Linguistics

-          Basics of Statistical Language Modeling and N-grams

-          Hidden Markov Models

-          Decision Tree Models

-          Maximum Entropy Models

-          Classification and Clustering

-          Sentiment Analysis

-          Topic Modeling

-          Applications in Machine Translation, Information Retrieval, Information Extraction, and Biological Data Analysis.




·         Two Assignments: 30% (2 x 15%)

·         Two Presentations: 20% (2 x 10%)

·         Project:                    50%


Recommended References


Chris Manning and Hinrich Schültz.  Foundations of Statistical Natural Language Processing.  The MIT Press, 1999.

Daniel Jurafsky and James H. Martin.  Speech and Language Processing.  Second Edition.   Pearson Education, 2008.

Thomas M. Cover and Joy A. Thomas.  Elements of Information Theory.  Wiley, 1991.

Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.  Cambridge University Press, 1998.