Instructor: Yifan Peng (yip4002@med.cornell.edu)
Time: Jan. 21, 2025 - May 6, 2025, 5:00-6:15 pm East Time on Tuesdays and Thursdays
Location: Remote
TA: TBD
Office Hours: TBD
Grading: Letter grade
Course Aims and Outcomes
Natural Language Processing (NLP) stands out as a pivotal technology in the realm of artificial intelligence. Its significance has noticeably amplified within the medical field in recent years, as vast amounts of unstructured text data await analysis from databases such as Electronic Medical Records, biomedical literature, and clinical trials. By enabling computers to comprehend human-written language, NLP can effectively extract crucial biomedical information from these vast text resources. Moreover, the advent of technologies like ChatGPT and other Large Language Models (LLMs) holds the promise of vastly transforming research methodologies and clinical practice.
This course aims to provide students with comprehensive knowledge of Natural Language Processing, generative AI, and related health applications. Students will learn about various sources of text data, integral linguistic structures, and a range of processing methods. The course will also offer hands-on programming experience using the Python language and toolkit, equipping students with invaluable skills to handle and manage text data and resolve health-related computational problems.
Format and Procedures
The course follows the progression of topics: regular expression, text normalization, language model, text classification, sequence labeling, parsing, word vector, introduction to deep learning, convolutional neural network and recurrent neural network, and transformer-based method. Each topic is addressed in a module lasting 1-2 weeks. Students will work on individual assignments alongside these activities, as well as participate in a team project.
Prerequisites
- Python: Prior exposure to programming and Python is highly recommended. We will provide a tutorial on Python in the first two weeks.
- Basic Probability and Statistics: You should know the basics of probabilities, mean, standard deviation, etc.
- College Calculus, Linear Algebra: You should understand matrix/vector notation and operations.
Reference Texts
The following texts are useful, but none are required.
- Natural Language Processing in Biomedicine
- Natural Language Processing with Python
- Foundations of Statistical Natural Language Processing
- Speech and Language Processing (3rd ed. draft)
- Natural Language Processing
If you are not very familiar with Python
If you are interested in Deep Learning
Tentative Course Schedule Overview
Date | Topics | Event | Deadline |
---|---|---|---|
1/21 | Course overview | ||
1/23 | Introduction to NLP | ||
1/28 | Regular expression | Assignment 1 | |
1/30 | Lab: Regular expression in Python | ||
2/4 | Text preprocessing | ||
2/6 | Lab: Text preprocessing in Python | ||
2/11 | n-gram | Assignment 2 | Assignment 1 |
2/13 | Text classification | ||
2/18 | No classes | ||
2/20 | Evaluation metrics | Literature review | |
2/25 | Part-of-speech tagging | Assignment 2 | |
2/27 | Parsing | ||
3/4 | Word vector | ||
3/6 | Lab: word vector | ||
3/11 | Intro to deep learning | Project proposal | |
3/13 | Intro to deep learning - II | Assignment 3 | |
3/18 | CNN | ||
3/20 | RNN | Literature review | |
3/25 | Transformer | ||
3/27 | Fine-tuning techniques | Assignment 3 | |
4/1 | No classes | ||
4/3 | No classes | ||
4/8 | Large Langauge Model | Assignment 4 | |
4/10 | Lab: BERT | ||
4/15 | Prompt engineering | ||
4/17 | Lab: LLM | ||
4/22 | LLM fine-tuning | Assignment 4 | |
4/24 | LLM evaluation | ||
4/29 | Multimodal large language models | ||
5/1 | AI agent | ||
5/6 | Trustworthy AI | Final project paper |