Introduction to Natural Language Processing
Lecture 1: Introduction
-
What is meant?
-
Computers process structured data (databases, spreadsheets, etc)
-
Most data is unstructured. Easy for people, hard for machines
-
Studied since the 40s
-
Progress in large part due to large datasets (corpus linguistics) and ML
-
-
Applications
-
Part of speech tagging, identifying what's what in a sentence
-
information extraction, identifying named entities (locations, organizations), relations (Lincoln is a city in Nebraska) or times
-
Sentiment analysis, uncovering emotion/feeling in text
-
Question answering, taking natural questions, finding and returning an answer
-
Summarization, retrieving/reporting most important info in a document
-
Machine translation
-
dialogue, engaging in conversations with real users to find preferences/instructions and replying naturally
-
-
Approaches
-
Primarily leveraging ML (uses large training sets to build models)
-
Models used to classify/generate text, discriminating and predicting labels or output resembling training
-
-
Challenges
-
English is pretty tough!
-
Syntactically ambiguous
-
Semantically ambiguous
-
-
Overview
-
Topics introduced in Lecture
-
Homeworks for practice
-
Project is open-ended
-
Topics
-
N Gram models
-
Na\"ive Bayes & Semantic Analysis
-
Logistic Regression
-
Representations/Embeddings
-
RNNs & Attention
-
Information Extraction
-
QA
-
Summarizsation
-
Machine Translation
-
POS tagging
-
Dialogue Systems
-
-
Lecture 2: $N$-Gram Language Models
-
Many ways to model text
-
Many are probabilistic, modeling probability distributionss over sentences
-
Compare probabilities to make decisions
-
-
Models trained from data sets (corpora) by counting occurrences and normalizing
-
Models based on $N$-grams, or word sequences of length $N$
-
Smoothing is used to prevent 0 probabilities
-
-
Overall: models, evaluation, and smoothing
-
Let $W = \langle w_{1}, \ldots, w_{n} \rangle$ be sequence of words
-
Goal is computation of joint probability
-
Related: Compute conditional probability of next word given prior words
-
Probabilistic model allows us to compute one of these
-
Recall conditional probability (Bayes's Rule), $P(A,B) = P(A)P(B | A)$
-
Chain rule of probability, $P(w_{1}, \ldots, w_{n}) = \prod_{i }P(w_{i} | w_{1}, \ldots w_{i - 1})$
-
Estimation is from counting and dividing (this is the frequentist approach), $C(W)$ counts occurrences in corpus $\mathcal{C}$
-
Problematic for long sequences, as it may not occur, which may mean probabilities of zero or undefined
-
Instead, simplify and use bigrams!
-
First order Markov assumption – the probability of the next event depends only on the event that just happened
-
Generalizes to $n$th order assumption for tuples of $n + 1$ words
-
Simplest case is the unigram model, where words are individual, and gives a 0 order Markov model, or probability is just the count over the size of the corpus
-
-
Berkeley Restaurant project has 9332 sentences, 1446 words
-
Sometimes things have non-existent bigrams
-
Probabilities then, can be calculated easily. Note, include start/end tokens
-
Probabilities can eventually cause underflow when multiplying!
-
Instead, add logs of probabilities
-
Since probability values aren't generally cared for, instead, comparison is the issue, and this is preserved
-
-
Works as long as $N$ isn't too large
-
But looses long-range dependencies!
-
Mostly an okay thing and long-range dependencies can be captured with RNNs
-
SRILM, KenLM, Google Book $N$-Grams
-
-
Evaluation
-
Training set used to compute probabilities
-
Goal is to train such that model generalizes well!
-
Estimate using a separate test set, which is ideally unbiased
-
Performance on this is what we expect to see in practice
-
Don't test on training data!!!!!!!
-
Data is often split around 80/20 for training/test (90/10 also)
-
-
Common way to evaluate on test set
-
Perplexity $PP(W)$, calculates inverse probability normalized by number of words $N$, $PP(W) = \sqrt[N]{1/P(w_{1}, \ldots, w_{N})}$, for bigrams, similar, with sum of bigram probabilities
-
Minimizing perplexity on test set, maximizes test data's probability
-
-
Models can also be used to generate text!
-
Generate a random bigram according to $P(\langle s \rangle, w_{1})$
-
Generate bigram according to probability, and continue until we generate an end token.
-
-
Larger $n$ valued models are more coherent, and it's important to select an appropriate corpus!
-
Valid $n$ grams may not appear in training sets!
-
Smoothing allows prior information to be incorporated into the model before data is used to update it
-
Laplace Smoothing
-
If zero probabilities come from zero count,s then ensure no counts are zero!
-
When counting $n$-grams, initalize each counter to 1 rather than 0 (pseudocounts)
-
Probability estimates are based on modified counts, and change estimate to a Laplace form, $P_{Laplace}\langle w_{i} \rangle = \frac{C(w_{i}) + 1}{N + V}$, where $N$ is the size of the corpus, $V$ is the vocabulary
-
Amount added can be $k \neq 1$ depending on weight of the prior
-
-
Backoff use trigram if counts non-zero, else bigram or lowern
-
Interpolation, mixes unigram, bigram and trigram models
-
Fix constant parameters $\lambda_{i}$ to weight each term where $\sum_{i} \lambda_{i} = 1$
-
Or have values depend on context
-
Tune $\lambda$ on independent hold-out set/use cross-validation
-
-
Kneser-Ney Smoothing
-
Most common method
-
Based on absolute discounting
-
Redistributes probability mass from common to less common, simply subtracting $d = 0.75$ works well, and then add interpolation on unigrams.
-
Extends by including a term which models probability that $w$ will be in a novel continuation, how many words will pair with $w$
-
Final formula kinda long (slide 27)
-
-
Stupid Backoff
-
Stupid as it doesn't care about renormalizing to a true probability distribution
-
Roll back to shorter unigrams if counts = 0, and don't renormalize since it's too much time
-
-
Lecture 3: Na\"ive Bayes & Sentiment Analysis
-
Classification is core area of ML & NLP
-
Many tasks, includes spam filtering & sentiment analysis
-
Simple generative model, na\"ive Bayes, used for classification
Discriminitave vs generative
-
Discriminative models take an instance as input, and predict a label
-
Real-valued labels are regressions
-
Finite-set labels are classifications
-
Training comes from a training set with true labels (supervised learning)
-
Then searches set of possible models optimizing for some objective (minimizing prediction error, etc.)
-
Many model types
-
Decision trees
-
ANNs (discussed later)
-
SVMs
-
nearest-neighbor
-
-
-
Generative models define probability distributions
-
Generate new instances resembling training data
-
HMMs, Bayesian networks, probabilistic models, na\"ive Bayes
-
Trained via unsupervised learning, that is, lacking labels; note, labels mean generative model can exist for each class
-
Can be used to discriminate estimating probability that a given instance is generated
-
Classification
-
discriminative models classify text into finite set of categories
-
Subject classification, spam filtering, author ident,
-
Given training set output a classifier $\gamma: \mathcal{D} \rightarrow \mathcal{C}$ predicting labels of new documents
-
Representing each document is important
-
Option: each document is a bag of words with counts associated with them
-
Simple model, looses position information
-
May need to drop some things
-
-
Na\"ive Bayes
-
Recall Bayes Theorem: $P(c|d) = (P(d|C)P(c))/P(d)$
-
The probability of the class given the document
-
$P(c)$ is the prior probability of label
-
$P(d)$ is the probability of the document
-
$P(c|d)$ is probability of $c$ given $d$
-
$P(d|c)$ is probability of $d$ given $c$
-
-
Product Rule $P(A \land B) = P(A | B)P(B) = P(B | A)P(A)$
-
Sum Rule $P(A \lor B) = P(A) + P(B) - P(A \land B)$
-
Total probability
-
If events are mutually exclusive, and sum of probabilities = 1,
-
$P(B) = \sum P(B|A_{i})P(A_{i})$
-
-
MAP/Maximum a Posteriori classification
-
Given document, our goal is to find the most likely label
-
Done using argmax
-
$argmax_{c \in C} P(c|d)$, or, assuming feature representation, of $P(x_{1}, \ldots x_{n} | c)P(c)$
-
$P(c)$ is easily estimated, fraction of documents that are class $c$
-
Estimating $P(x1...xN|c)$ is harder, as the number of combinations is exponential, and there won't be sufficient data
-
-
Make the Naive Bayes Assumption!
-
$P(x_{1}, x_{2}, \ldots, x_{n} | c) = \prod_{i} P(x_{i}|c)$
-
Assumes, of course, that feature values are conditionally independent given the class
-
Classifier is $c_{NB }= \mathrm{argmax}_{c \in C} \prod_{i} P(x_{i}|c)$
-
Values needed are polynomial!
-
Standard method, and incredibly practical
-
Works well when you have moderate/large training sets
-
when attributes that describe instances are conditionally indepent given classes
-
Learn Algorithm (See slide 15)
-
Estimate $P(c)$, call $\hat{P}(c)$
-
For each word, $w_{i}$ in vocabulary, estimate, $\hat{P}(w_{i}|c) = \frac{count(w_{i},c)}{\sum_{w\in V} count(w,c)}$
-
-
Classify:
-
Compute above $c_{NB}$ using $\hat{P}$
-
-
Smoothing done using Laplace smoothing with Pseudocounts!
-
Sentiment Analysis
-
Replace count-based bag-of-words with binary indicators
-
Negations have to be handled
Lecture 4: Na\"ive Bayes & Sentiment Analysis, Continued
Evaluating Classifiers
-
Goal: generalization, and estimating performance
-
But, other goals possible
-
Estimate generalization performance of model/hypothesis
-
Tuning parameters
-
Comparing two algorithms on specific task
-
-
Can't answer with perfect certainty
-
Estimator must be chosen, determining the distribution thereof, and bounding it
-
Must work regardless of distribution of training/test data
-
-
Statistical variation related to specific application is significant
-
Plan experiments well, don't test on training data!!
-
It biases the performance estimator
-
And don't let duplicates into training/testing sets if using data augmentation
-
Holds for validation set used for early stopping/tuning parameters!
-
-
Confidence intervals are used
-
$error_{\mathcal{D}}(h)$ is the 0-1 loss hypothesis on instances drawn according to $\mathcal{D}$. If test set has $N$ examples drawn independently (of $h$)
-
0-1 loss is the probability that $h$ makes mistake
-
and $N \geq 30$
-
Then, with 95% probability, the error likes in $error_{\mathcal{V}}(h) \pm 1.96\sqrt{\frac{error_{v}(h)(1 - error_{v}(h))}{N}}$
-
So, if the hypothesis misclassifies 12 of the 40 examples, with 95% confidence, the error is in that range.
-
Confidence interval can be generalized by changing real-number coefficient
-
Based on binomial distributions, mean error of an unbiased estimator
-
Can approximate binomials by normal distributions
-
And so, then, based on standard deviations in normal distribution
-
Can be justified due to the Central Limit Theorem
-
-
Steps
-
Pick Parameter to estimate
-
Choose Estimator
-
Determine Probability Distribution
-
Find appropriate interval
-
-
Comparing Learning Algorithms
-
Can't just compare error rates on single test set
-
Use instead $K$-fold cross validation and paired $t$-test
K-fold cross-validation
-
partition data set $I$ into $K$ equal sized subsets, of at least size 30
-
For i from 1 to k
-
use $X_{i}$ for testing
-
Let the training set be all less $x_{i}$
-
Train each algorith
-
Let $p_{i}^{j}$ be error on test set
-
$p_{i} = p_{i}^{1} - p_{i}^{2}$
-
-
Error difference is estimated as sum of $p_{i}$ all over $K$
-
Estimate confidence that true error is < 0.
-
Confidence that $L_{I}$ is better than $L_{2}$ on the task
-
One-sided test with confidence from $t$ distribution with $K - 1$ DOF
-
Then calculate, true difference of expected error is $p + t_{c,K-1} s_{p}$
-
$s_{p} = \sqrt{\frac{1}{K(K-1)} \sum_{i=1}^{k} (p_{i} - p)^{2}}$
-
Then, check if $p < - t_{c,K-1} s_{p}$
-
-
Must understand that to show that learning algorithm performs better, at 95% confidence, 5% chance of wrong
-
Overall confidence is 18.5% chance that at least one is wrong
-
Must account or use a more appropriate test!
-
More specific measures
-
Prediction error may not tell the whole story!
-
Bad test data may mean that accuracy isn't helpful
-
Confusion matrices can be used instead
-
True class vs predicted class
-
Cells for true positives, false positives, true negatives, false negatives
-
Generalizes to many classes
-
Normalized to fractions and represented as heat map
-
-
Precision-recall curves
-
Precision is fraction of retrieved that are positive
-
recall is fraction of positives retrieved
-
Can vary threshold to trade precision for recall
-
Compare curves based on containment
-
More suitable than ROC for some large numbers of negatives
-
Uses $F_{\beta}$ to combine at a specific point
-
Lecture 5: Logistic Regression
-
Recall Na\"ive Bayes, which is generative
-
Logistic regression is discriminative
-
Weighted sum of features of instance, and processing with sigmoid or softmax, which is probability of $x$ being class $c$
-
Output is probability, performance measured with cross-entropy, a loss function
-
Optimize using stochastic gradient descent
-
Logistic models built into neural nets for deep learning
Sigmoid
-
Each instance is a vector $x = [x_{1}, x_{2}, \ldots, x_{n}]$ of features
-
Supervised training, each instance has label, $y \in \{0, 1\}$
-
Output of model on input is $P(y = 1 | x)$
-
Weighted sum $z = wx+b$, where $w_{i}$ is feature weight, $b$ is the bias term
-
Final output is $P(y = 1 | x) = \sigma(z) = 1/(1 + e^{{-z}})$
-
Discrete valued prediction thresholded at 1/2
-
Called logistic/sigmoid function
-
Squashes $z$ into $[0, 1]$! (sometimes in older stuff called a "squashing function")
-
Assume classifier is trained with given values and biases. (5.1.1 from textbook)
-
Compute weighted sum, run through logistic function, and compare to threshold
-
Training
-
$w$ and $b$ must be discovered
-
Given labeled training corpus, map documents to feature vectors, and attach label
-
Initial $w$ and $b$ randomly!
-
Evaluate each training vector predicting $P(1|x)$ and use loss function $L$ to compare prediction to true label
-
stochastic gradient descent used to update $w$ and $b$, reducing $L$ on $x$
-
Repeat
Loss Functions
-
Many available
-
Cross-entropy used for classification/prediction of probabilities
-
$\hat{y} = \sigma(wx + b)$ is the prediction, $y \in \{0, 1\}$ is true label
-
$L_{{CE}}(\hat{y}, y) = -(y \log \hat{y} + (1 - y)\log(1 - \hat{y}))$
-
Since $y$ is binary, one term is $0$ and $L_{CE}$ is negative of log of probability of correct class. Larger predicted probability, the smaller the loss!
Stochastic Gradient Descent
-
Goal is to find params, $(w, b)$ or, $\theta$ overall and minimize average loss on entire set!
-
$\theta^{*} = \argmin_{{\theta}}\frac{1}{m} \sum_{{i = 1}}^{m} L_{{CE}}(\hat{y}^{(i)}, y^{(i)}; \theta)$
-
Continuously differentiable, as $L_{{CE}}$ is by def'n, allows for gradient descent
-
Consider a linear unit: output is weighted sum of inputs with no $\sigma$ applied (weighted sum, bias, no sigmoid!)
-
If there's only one feature, this is regression, that is, find a straight line that best fits training data
-
Let pass through origin, $b = 0$
-
Slope is just a single parameter, $w_{1}$
-
$J(w_{1}) = \sum_{i}L(\hat{y}^{{(i)}}, y^{{(i)}}; w_{1})$
-
Goal is to minimize $J(w_{1})$
-
Writing generally, can be plotted and calculated.
-
Take derivative & solve for 0
-
-
Calculated using $w^{{\prime}} = w - \eta \nabla J(w)$
-
$\eta$ is the learning rate
-
Gradient, $\nabla J(w)$, is partial derivatives $\left[ \frac{\delta J}{\delta w_{i}}\right]_{i=1}^{n}$
-
$\frac{\delta J}{\delta w_{i}}$ is a partial derivative is how much a change in $w_i$ changes $J$
-
-
$J$ is error, compute gradient of $J$ with respect to parameters, apply gradient descent
-
Cross entry loss function with respect to each weight (see textbook for function)
-
$x_{j}(\sigma(wx+b) - y)$ is the partial derivative for single instance
-
mean of partials for multiple instances
-
Plug into formula to get weight update
-
Update is smaller if prediction is closer
-
As magnitude of feature grows, $w_{j}$ becomes more responsible for more error
-
If $\hat{y} > y$, $wx + b$ needs to decrease
-
-
-
Standard GD, mean of vectors across entire set
-
Esitmate of expected gradient on random instance
-
Can require huge memory
-
-
Stochastic Gradient descent partitions, randomly, into training set minibatches of size $m$
-
Compute average per minibatch
-
Update per minibatch
-
Each pass through all is an epoch
-
Saves memory
-
Gradient estimates have high variance, avoid getting stuck in local minima in non-convex situations
-
Regularization
-
Is perfect fit the best model?
-
Recall, the goal is to generalize well, there may be noise in data afterall
-
Perfect fits tend not to generalize well
-
Training accuracy vs simplicity
-
Want moth accuracy and simplicity
-
$h$ is a model, training error on $\mathcal{X}$, $error_{{\mathcal{X}}}(h)$ is sum of loss function over size of dataset
-
Generalization error, $error_{{\mathcal{D}}}(h) = \mathbb{E}_{{x \tilde \mathcal{D}}}[L(h(x), y)]$
-
Model overfits if there's an alternative model such that $error_{{\mathcal{x}}}(h) < error_{{\mathcal{X}}}(h^{{\prime}})$ and $error_{{\mathcal{D}}}(h) > error_{{\mathcal{D}}}(h^{{\prime}})$
Regularizers
-
When many params, more likely to overfit
-
Training adds regularizer, $R(\theta)$ to argmin, weighted with $\alpha \geq 0$
-
L2, square of norms
-
L1, norm of vectors
Multiclass
-
Done with "Softmax", when $> 2$, one set of weights and one set bias for each class
-
Then, $z_{c} = w_{c} x + b_{c}$
-
Probability distribution calculated using softmax, not $\sigma$
-
$\mathrm{softmax}(z_{c}) = \frac{exp(z_{c})}{\sum_{{c^{{\prime}} \in C}} exp(z_{c^{\prime}})}$
-
Generalize cross-entropy to handle multiclass, y is generalized
-
Labeled with one-hot vector
-
When training, $\delta(-\log \hat{y}_{c^{*}})/\delta w_{i}$ for each weight, update with SGD
-
Lecture 6: Vector Semantics and Embeddings
-
Vector representations of words
-
Success of tasks depends on document/word representations
-
Bag-of-words representation from ch4, engineered features from ch5
-
BOW, word corresponds to dimenion in a $|V|$-dimensional vector
-
-
Dense vector representation
-
embedded representation, represents words in numeric vectors of smaller (10, 50, maybe more) dimensions
-
Want similar words to cluster
-
Similar contexts imply similar meanings
-
Learn embedding by grouping words based on context
-
-
Consider: lexical, vector semantics; words and vectors; cosine similarity; TF-IDF; Word2Vec
Lexical Semantics
-
How do words relate?
-
Synonymy – similar meanings
-
Antonymy – opposing meanings
-
Similarity – based on what is similar, "car" and "bicycle"
-
Relatedness – not similar, relating to each other, "coffee", "cup"
-
Semantic field – covering some sort of semantic domain
Vector Semantics
-
Want to numerically represent words such that those that share some sort of lexical semantics cluster
-
Because regression models (and NNs) require numeric inputs!
-
Trained model that predicts positive when vanish appears should respond similarly to disappear
-
Representations grouping words by context
-
Two-dimensional projection of a 60-dim embedding (t-SNE)
Words & Vectors
-
Term-Document matrix, $|V| \times |D|$ matrix. Column $d$ is $d$'s BOW representation
-
Even bag of words representation clusters!
-
Rows of TD matrices can represent words
-
Context then is the set of documents that it appears in and how often
-
Similar words will have similar representation
-
-
Word-Word or Term-Context matrix can be used ($|V| \times |V|$)
-
$(i,j)$ hold the number of times $i$ and $j$ appear in the same context
-
Context can be document, fixed window, etc.
-
-
Cosine Similarity
-
Compute similarity of words/documents
-
Often, cosign of the angle \[\frac{v \cdot w}{\Vert v\Vert \Vert w \Vert} = \frac{\sum_{{i = 1}}^{N} v_{i} w_{i}}{\sqrt{\sum v_{i}^{2}}\sqrt{\sum w_{i}^{2}}}\]
-
Normalize vectors by dividing by length and take dot product of unit vectors
-
1 if identical, 0 if orthogonal
-
TF-IDF: Term Frequency-Inverse Document Frequency
-
Must be careful about context definition
-
Some words appear frequently in any corpus
-
Can inflate similarity measures!
-
Weights terms by frequency, reducing weight if appearing in many documents
-
Term Frequency is the count of $t$ in $d$ +1 to avoid log zero \[\mathrm{tf}_{t,d} = \log_{10}(\mathrm{count}(t, d) + 1)\]
-
Log used to differentiate between orders of magnitude \[\mathrm{idf}_{t} = \log_{10}\left( \frac{N}{\mathrm{df_{}}_{t}}\right)\]
-
$N$ is documents
-
$\mathrm{df}_{t}$ nuber containing $t$
-
-
Weight of $t, d$ is $w_{{t,d}} = (\mathrm{tf}_{{t,d}})(\mathrm{idf}_{t})$
-
Word-word matrix then weight by tf-idf
-
Vector for term computed as normal
-
normalize/compute cosine similarity
Word2Vec
-
From Mikolov et al.
-
vectors from tf-idf long, sparse
-
Dense, short vectors prefered
-
Fewer params to learn!
-
Captures synonymy better!
-
Turns out, it works better in practice!
-
Intuition
-
Given $w$, rather than count how often it appears near a word
-
Predict how likely to appear near it
-
Trainable using logistic regression
-
Supervised, since any word appearing near another is a positive example
-
Negative sampling also needed
-
Weights can be used as embeddings
Training
-
Given $t$
-
and context words $c$ as positive example
-
Sample for negative examples, and count, divide by words not found (as probability) for some parameter $\alpha < 1$
-
Train with logistic regression, random weights
-
Final weights are given embeddings
-
Then, given $t$ and $c$, predict probability near each other. $t$ and $c$ are represented as embeddings, similarity is dot-product; probability, from logit function
-
Probability of negative pair, is above from 1
-
-
Given positive pairs and negative pairs, maximize log sum of positives + log sum of negative
-
Optimizing finding an embedding maximizing $c \cdot t$, and minimizing $n \cdot t$ for all $t$
-
-
$w$ can be both target and context. Training gives two embedding matrices, a target and an context
-
These are the regression parameters, row $w$ from $T$ is $d$-dim representation of word, column $w$ of $C$ is $d$-dim representation
-
$T$ is often final embedding
-
concatenation could be option
-
Visualization
-
High dimension representations are hard to visualize
-
Hierarchically clustered dendograms are option
-
t-SNE projects to 2 dimensional space
Semantics
-
Embeddings can allow semantic arithmetic
-
$(king - man) + woman \approx queen$
-
-
Adjective to comparative to superlative
-
Country-capital common distance
-
Can reveal biases!
Miscellaneous
-
word2vec applied to Materials Science paper abstracts
-
Predicted clustering groupings in table
-
Predicted properties of materials before scientists discovered them
-
-
.*2vec exists and extends beyond text
-
node2vec embeds representation of graphs
-
Context must be defined for embedding
-
Presentations: Project Ideas
Deleting
-
Movie Critic Analysis
-
A form of sentiment analysis, but perhaps on a continuum
-
Training a model to discriminate between professional and amateur critics
-
IMDb and Stanford Sentiment Treebank datasets
-
-
Music Analysis
-
Why are songs more popular than others
-
Why things related to ranking happen
-
Sentiment analysis on lyric
-
predictive from sentiment, etc.
-
-
Datasets from Billboard charts
-
-
Professor Ratings
-
Trends, correlation with language
-
RateMyProfessor datasets, but most focus on correlation between numerical scores
-
Scooby Data Doo
-
Intent Classification
-
Assigns document intent
-
Useful for businesses to provide service, chatbots, automated assistants
-
-
Toxic Comment Classification
-
More interactions online, but people are kinda toxic
-
Understanding interaction is beneficial in making the internet a safer place
-
Filtering social networking sites
-
Understand how hate speech etc., spread through network sites
-
Google Translate 2.0
-
Sentiment Analysis
-
Marketing use
-
Employee satisfaction
-
real-time information
-
public opinion
-
spam detection
-
-
Dialog Generation
-
Goal-oriented
-
Conversational systems
-
Often a mix of the two
-
Customer Service
-
Chatbots
-
Blended approach
-
Pattern recognition, state machines, Markov processes, vector space, neural networks
-
-
Grammar Correction
-
Spelling, grammar, punctuation errors
-
Sequence tagging
-
CNNs
-
transformer-models
-
Proof By Intimidation
-
Taxonomy Translation
-
Experts use special, domain ontologies for efficient communication, jargon
-
Developing models that can be used to translate from experts to non-expert
-
For a domain, learn an ontology for experts; then a general, for laymen; find connections between the two
-
-
Common Sense
-
Reasoning and Validation in language
-
Uses understanding and generation
-
High accuracy, only text-to-text
-
Applicable in many areas: taxonomic reasoning, recommendation systems, chatbots, MT
-
Latural Nanguage
-
Speech processing
-
Recognizing spoken language
-
-
Semantic Analysis
-
Understanding meaning
-
-
Lemmatization
-
De-inflecting words, returning to original form
-
-
Text Classification
-
Categorization
-
-
Summarization
-
Simplify document while communicating core message
-
Not always generative
-
Removing redundant information
-
Abstracting/extracting
-
-
Sentiment Analysis
-
Exactly what's been previously covered
-
Aspect-based analysis, looks for multiple indicators/sentiments
-
Homo Bayesians
-
Data to text generation
-
auto-correct/text completion
-
Various sorts of neural models, BOW, RNNs w/ Attention, etc
-
Text to SQL, etc
-
-
Dialogue State Tracking
-
Component of dialogue system
-
Various ways of handling state
-
Belief Tracking
-
Bricc Squad
-
POS tagging
-
Solves other NLP problems
-
Helps with language learning
-
Meaning between homophones
-
Lexical
-
Rule-based
-
Probabilistic
-
Deep Learning
-
-
Missing Elements
-
Filling things in
-
Missing elements in foreign languages
-
Fused head ident (find phrase missing noun; rule based or ML)
-
Fused head resolution (finding in context)
-
Bridging anaphora
-
Bridging anaphora resolution
-
Lecture 7: Neural Networks and Neural Language Models
-
Consider human neural nets
-
$10^{10}$ neurons
-
$10^{-3}$ switching time, $10^{{-10}}$ switching in silicon
-
Connections per neuron, $10^{4}$ to $10^{5}$
-
Scene recognition is incredibly short
-
100 inference steps, which doesn't seem like enough
-
But computation is massively parallel
-
-
Goal is to simulate this
-
using neuron-like switching
-
weighted interconnections among units
-
highly parallel, distributed
-
Want to tune weights automatically
-
Many layers learn to extract features, rather than manually creating this features
-
Note: major differences between ANNs for ML and ANNs for modeling biology
-
-
Used
-
high dimensional input, discrete or real-valued
-
output can be discrete or real-valued
-
Output is vector
-
Data may be noisy
-
Target function form is unknown
-
Readability is unimportant, and long training is acceptable
-
-
History
-
Linear units, perceptron algorithms (1940)
-
couldn't handle data not linearly separable
-
Couldn't train multi-layer networks, but aware of usefulness
-
-
Multilayer networks trained with Backpropagation (1980)
-
Replaced with SVMs, Boosting in 1990s
-
-
Deep Architectures (2000s)
-
Better hardware, software support for deep networks
-
Uses backprop withh larger datasets, algorithm improvements
-
improve performance considerably
-
-
Impressive applications
-
-
The Inevitable — Skynet, TBD
-
Basic Units
-
Takes weighted sum of input, $z$
-
Passes through non-linear activation function $f$
-
If $f$ is logistic, it's the same as logistic regression function
-
Single nodes are limited in what they represent
-
Linear Threshold Unit, the perceptron
-
Activation thresholds weighted sum at 0
-
Represents things like $\land$ well
-
Other functions aren't representable!
-
Because they're not linearly separable, XOR
-
Would be a network of units
-
The XOR Problem
-
XOR, intersection of two separators
-
$g_{1}(x) = 1x_{1} + 1x_{2} - 1/2$
-
$g_{2}(x) = 1x_{1} + 1x_{2} - 3/2$
-
So, $z__{}i = 0, g_{i} < 0$, $1$ otherwise
-
Now, $g(z) = 1z_{1} - 2z_{2} - 1/2$
-
-
Remaps vectors $x$ to $z$ so that separable in new space
-
Input layer, output layer
-
hidden layer is $g_{1}$, $g_{2}$
-
Two-layer perceptron/feed-forward neural network
-
Non-linear activation in hidden output
-
Many dozens of hidden layers, several output nodes
-
-
Activation functions: $z$ is weighted sum
-
Linear unit
-
Sigmoid/Logistic
-
Softmax (generalizes sigmoid)
-
Tanh
-
-
Hidden nodes
-
Rectified Linear Unit, max 0, z
-
Good default
-
Variants to handle $<0$ differently
-
-
Backprop
-
Training multiple layers means tuning parameters in all
-
Uses gradient descent minimizing a loss function
-
Cross-entropy is a common loss function
-
$L_{{CE}}(\hat{y},y) = -\sum_{{c \in C}}y_{{c}}\log\hat{y}_{{c}} = -\log\hat{y}_{{c^{*}}}$
-
-
Must compute gradient of loss function for each param, even several layers from output
-
Backprop feeds forward inputs to outputs, propagates back error via chain rule for derivatives
-
Simply, modularly decomposed
-
Computation Graph
-
Given complicated function, want to know partial derivative with respect to parameters
-
$f$ is represented via modular fashion
-
Decompose gradient into basic op
-
$\frac{\delta f}{\delta f} = 1$
-
Then partial with respect to edges leading, $\delta f/\delta a$ = 1, because partial with respect to sum is 1
-
Then, with respect to $x_0$ is $1w_0 = 3$ (via chain rule)
-
For $x = [1.0, 4.0]$, $\nabla_{w} f = [1.0, 4.0]$
-
-
Then, for $x = [1.0, 4.0]$ for $f(x) = \sigma(x)$, $\nabla_{w} f = [0.1966, 0.7866]$ (once it's done all the way through)
-
Modular, once we have a formula for gradient, we can apply anywhere in larger graph
-
Generalize to $J(w) = \frac{1/2}\sum_{{i=1}}^{{n}}(\hat{y}_{{i}}^{t} - y_{i}^{t})^{2}$
-
$w_{{5*}}$ and $w_{{6*}}$ tie to output units
-
Gradient/weight updates as before, then
-
$w^{{\prime}}_{{53}} = w_{{53}} - \eta\frac{\delta J}{\delta w_{{53}}}$
-
-
-
Multivariate chain rule says sum path from $J$ to $w_{{42}}$ – looks kinda ugly
-
Computationally simple! Compute for specific input
-
Modular form means once things are computed, plug values in and compute for earlier layers
-
Backprop algorithm
-
Submit input $x$, fed forward to outputs
-
Compute network loss
-
Prop error to compute loss gradient with respect to each weight, update weights
-
Automatic in TensorFlow, automatic differentiation based on computation graph
Putting Things Together
Hidden Layers
-
How many?
-
Deep networks build potentially useful representation via composition
-
Layer is vector-valued, non-linear
-
-
Performance improvement isn't only from more complex networks
-
Increasing layers increases chance of overfitting, must have lots of data with deep networks, needs longer training
-
Boolean functions representable with two layers
-
Bounded, continuous function represented with arbitrarily small error with two layers
-
Function with arbitrarily small error in three layers
-
Only an existence proof needed
-
Exponentially many nodes in layer
-
Might not be able to find correct weights
-
Overfitting and need for regularization noted
-
Regularization
-
Prevents overfitting
-
Add penalty term $R(\Theta)$ that is, the sum of the sizes of parameters
-
Other methods:
-
early stopping – when performance worsens on a validation set
-
Dropout – ignore random subset of edges
-
Data augmentation – copies of random instances with noise
-
Inject noise into hidden layers
-
Batch normalization of output before input to next layer
-
-
Be careful not to underfit also (overfit, then regularize)
Initialization!
-
Early on, random numbers near 0, $\mathcal{N}(0, 1)$
-
Nearly linear for logistic function, so GD works better
-
Deep networks increases variance, poor optimization due to vanishing gradients
-
-
Glorot initialization per layer, if has $n_{in}$ and $n_{out}$, initarize uniform over $[-r, r]$ or $\mathcal{N}(0, \sigma)$
-
$r = a \sqrt{\frac{6}{n_{in} + n_{out}}}$
-
$\sigma = a \sqrt{\frac{2}{n_{in} + n_{out}}}$
-
$a = 1$ for logistic, $a = 4$ for tanh, $a = \sqrt{2}$ for ReLu
-
Optimizers
-
Variants on gradient descent
-
Momentum optimization
-
Momentum term $\beta$ keeps updates moving in same direction
-
Update looks like $w^{{\prime}} = w - \beta m + \eta\nabla J(w)$
-
Moves through small local minima to better ones and along a flat
-
-
AdaGrad
-
Standard descends too quickly, then crawls too slowly
-
$w^{{\prime}} = w - \eta\nabla J(w) \oslash \sqrt{s + \epsilon}$
-
$s = s + \nabla J(w) \otimes \nabla J(w)$
-
circled ops are elementwise
-
$s$ accumulates squares of radient and learning rate for each dimension scaled
-
-
RMSProp
-
Exponentially decays old gradients to address this, adds $\beta$ to do so.
-
-
Adam
-
Combines Momentum and RMSProp
-
Uses iteration counter to prevent vanishing, etc.
-
Language Models
-
Recall language model can take input sequence, and predict probability over next words
-
Feed-forward does the same, takes input sequence, outputs probabilities
-
word2vec embeddings used
-
Pre-trained (faster) or trained with network (may be better)
-
Can fine-tune a pre-trained embedding to save time
-
Fixed word sequence input
-
Softmax output compared to target word
-
Trained with cross entropy loss
-
Fixed word sequence input
-
Embedding learned with model
-
Softmax compared to output
-
trained with cross-entropy loss
-
Summary
-
Neural units take weighted sum, and applies non-linear activation
-
Multilayer networks used because single unit is limited
-
hidden layers allow learning of own representations – nonlinear feature extraction
-
Stochastic GD used for training
-
Calculated using back prop
-
-
Neural models predict word after sequence of words
-
Can use pre-trained embeddings or learn their own.
-
Lecture 8: Sequence Processing with Recurrent Networks
-
Earlier required fixed-size input
-
RNNs work on sequences, like text, biological sequences, video, audio, etc
-
Can create novel output, text in a specific style, music, etc
Basics
-
Recurrent cells have connections going backward as well as forward
-
At $t$, neuron receives input $x_{t}$ and its output $y_{{t - 1}}$
-
-
Recurrent layer can be built, where each node gets the vector $x_{t}$ and the vector $y_{{t - 1}}$ (that is, the whole layer's output)
-
Output depends on past, that is, memory/state
-
State at $t$ is $h_{t} = g(h_{{t - 1}}, x_{{t}})$, and output is $y_{t} = f(h_{{t - 1}}, x_{t})$
-
$h_{t}$ then stores important information about input
-
Forward pass – evaluate forward by unrolling in time
-
Training
-
uses backprop through time
-
Unroll through time, use backprop on expanded graph
-
Forward pass batch of sequences through network yields output
-
Output sequence evaluated using cost
-
Gradients propagated backward through unrolled network and params
-
Sequence loss, comparing two sequences
-
weighted average of cross entropy across sequence, which can target certain parts of sequence
-
-
BPTT means gradient flows farther, can "vanish" or "explode"
-
Can happen with any network, RNNs very susceptible
-
Clipping to range $[-1, +1]$ mitigate explosion
-
Can also do batch normalization
-
Applications
Recurrent NL Model
-
Can use recurrent architecture to model language
-
Can predict netxt word in sequence given current & previous hidden
-
Train by feeding sequence of word, output softmax over next, train with cross entropy
Generating Sentences
-
Feed in a start, generate output, output fed into network as next input, continue until end or length limit
-
Hidden state fed in with random/all zeros
-
Labeling
-
Given sequence of words, categorize each
-
Softmax/argmax over classes, POS taging, named entity recognition
-
Can also classify an entire sequence, called a sequence to vector model
-
ignore all output except last
-
Compute loss on final output and apply BPPT
-
Stacked & Bidirectional RNNs
-
Can stack RNN layers on each other
-
bidirectional – process separately from end towards start
-
run in opposite directions, outputs combine in some way (concat, add, etc)
-
Can be used for sequence classification
-
LSTM & Gated Units
-
Long-term memory – vanishing & exploding gradients can be problematic
-
Can suffer from long times with long sequences
-
First inputs have diminishing impact
-
Add "long-term" memory
-
Accumulates information, and can decide to forget
LSTM
-
Vectors, $h_{t}$ is short-term, $c_{t}$ is long-term state
-
At $t$, some memories from $c_{{t - 1}}$ can be forgotten in forget gate, and new ones from input gate are added
-
Results are sent as $c_{t}$, $h_{t}$ and $y_{t}$ are from processing long-term in output gate
-
$g$ combines input with previous state
-
$f, i, o$ are gate controllers, vectors
-
$f$ is size $n$, and 0 or 1
-
$i$ controls remembering $g$
-
$o$ controls what long term goes out and to hidden
-
tework learns to remember based on input and last hidden
-
Peephole can be added, lets last long-term affect controlls
GRU
-
Simplified version of LSTM
-
Merges long-term and short-term states
-
Merges forget and input controls
Words, Subwords, Characters
-
Character level modeling sometimes used
-
Can be helpful to have more detailed information
-
Can use other RNNs to process at finer level and combine with word-level embeddings
-
-
Can be done using a birnn, and concatenating input, giving a character-level word-embedding
Lecture 9: Encoder-Decoders, Attention and Contextual Embeddings
-
Text generation of model generalized/applied (Machine Translation)
-
Encoder-decoder architecture
-
Encoder takes input, produces embedded representation, context
-
Decoder generates output from context
-
-
Generalized further by going from context vector to sequence contexts and attention mechanism
Text Generation
-
NLM can generate text sequence
-
Must feed start symbol
-
$y_{{t}} = f(h_{t})$, $h_{t} = g(h_{t - 1}, x_{t})$, $x_{t} = y_{t - 1}$ (this is an autoregressive model)
-
-
Can generalize from a prefix (NLM completes existing sequence)
-
Prefix input triggers hidden state vectors, outputs from prefix ignored
-
hidden state captures context
-
-
Can be used for MT
-
NLM trained on bitexts – sentence concatenated with translation
-
Feed sentence 1 then generate translation
-
Encoder-Decoder Network
-
Previous model generalizes to encoder-decoder
-
Encoder input generates states
-
Context vector is a function of hidden states, the embedded representations
-
Decoder takes $c$ and creates new states and outputs
-
Encoder can be any RNN, stacked biLSTM, etc (same with decoder)
-
c is final step, top layer encoder concatenated
-
Initialise decoder's state = c
-
Decoder state is function depending on output, previous state, context
-
Attention
-
Context vectors passed from encoder to decoder are embedded representation of input sequence
-
$c$ represents the entire input
-
Can't ensure context vector is sufficiently large to represent context!
-
Can use attention mechanism
-
context becomes sequence of vectors
-
Decoder focuses on relevant subset
-
-
-
Static context vector becomes sequence of context fectors
-
Uses score function, $score(h_{{i - 1}}^{d}, h_{j}^{e})$ gives relevance of encoder state to decoder state, likely uses some weight function (bilinear interpolation, can be learned!)
-
Concatenate scores over j into relevance vector
-
Normalize with softmax, and sum over j by encoder states
-
-
Architecture
-
Compute context vector at each step
-
-
Attention $\alpha_{{ij}}$ shows probability that output $y_{i}$ is aligned to a given input $x_{j}$
Lecture 9: Information Extraction
-
nat lang is unstructured, difficult to analyze automatically
-
IE identifies important information puts into strucutred form
-
named entity recognition (identify type of nouns)
-
relation extraction (ident. semantic relations among entities)
-
event extraction (ident events and organize temporally)
-
template filling (analyze docs, fill out standard forms)
-
Named Entity Recognition
-
Named entities are anything refered to by a proper name
-
Requires identifying what text is the entity (using a span of text), and then classifying it
-
Can classify as people, organizations, locations, geo-political entities, facilities, etc.
-
-
IO/IOB tagging
-
Inside, outside, inside/outside/beginning tags
-
Neural Method
-
Character level embeddings, GloVe word embeddings
-
BIDI LSTM outputs concatenated and fed into softmax
-
Output is distribution over IOB labels
-
Ignores dependencies
-
Conditional random field used to constrain outputs
-
Probabilistic graphical model, edges denote influence
-
-
$x$ is input, $y$ is possible predictions,
-
$s(x, y) = \sum_{i} A_{y_{i},y_{i + 1}} + \sum_{i} P_{i,y_{i}}$
-
$P$ is softmax probability, $A$ is transition cores, from tag $i$ to tag $j$
-
Then, calculate softmax over scores
-
Evaluation
-
Recall, Precision, $F$ Measure used
-
Be careful with result interp, what may initially appear as one error can be multiple
Relation Extraction
-
Comes after NER
-
Relations depend on domain
-
Classes exist for each entity
-
Then must learn to classify and connect relations
-
Ordered tuples, many different forms of notation
-
Relations depend on domain, and many have specific models
Pattern-Based
-
Uses static patterns to recognize relations
-
uses templates, and can be very precise, because it can be easily customized for a domain, but may lack in recall
Supervised
-
Given a trained classifier for each
-
Enumerate all pairs and find those pairs
-
Classify relation of related pairs
-
Classifier can be whatever works for the domain
-
What features?
-
Headwords/concatenations
-
Bags-of-words, bigrams
-
NER types
-
Syntactic models
-
Semi-supervised
-
Labeled data is scarce, use seed patterns/seed tuples
-
Build a bootstrap relation set
-
Then use similarity to find new, high-confidence tuples, for a given relation
-
Extract patterns using these tuples
-
Weight patterns with confidence based on known tuples
Distant Supervision
-
Use a database like DBPedia, acquires large number of seeds
-
Identify matching sentences, extract rich feature set
Extracting Times
-
Similar to NER, can differentiate between absolute, relative and durations
-
Rule-based, sequence labeling
-
Normalized with respect to specific date for calculation/comparison
-
ISO standard provides XML tags
Event Extraction
-
Ident events in text, assignable to points in time
-
IOB tagging via supervised learning
-
Tends to be built on verbs
-
Features depend on many things
-
Timing information can provide temporal ordering, using normal relation extraction
-
Time-bank corpus can be used
Template Filling
-
Some situations/events have standard information
-
Templates represent these, with slots to fill
-
Recognize templates, then extract and fill slots.
Lecture 10: Question Answering
-
Factoid questions – answers are simple facts in short text expressions
-
IR-Based – search for span of text to return answer
-
Knowledge-Based build semantic rep, and uses deduction to answer from KB
-
Hybrid uses a combination of the two
Information-Retrieval Based
-
Given question, find short snippet from a collection
-
Process question for IR to analyze answers
-
Identify relevant documents
-
Extract passages
-
Filter to identify relevant passages
-
-
Must also do answer-type detection, decides what the answer will look like in the end
-
Question Processing
-
Identifies a query, and other information
-
Identifies focus
-
Identifies question type
-
Extracts tokens to submit
-
query is tf-idf and cosine similarity, can be enhanced with bigrams
-
If documents are limited, query expansion can be used (adding terms with a thesaurus, varied suffixes)
-
-
Answer types look at Named Entity types, filter IR on that
-
Can use taxonomy of types (WordNet)
-
Classifier trained, uses embeddings of words, etc
-
-
-
Passages come from ranked set
-
identify relevant paragraphs/sentences
-
partition documents into passages
-
NER used for filtering
-
Rank with learned model
-
-
Answers extracted with Span labeling
-
Return matching named entity if answer typue is specified
-
Problem hen answer type is unavailable/non-existent (consider definition questions)
-
Use question keywords matching
-
words not in query
-
Length of sequence of keywords in candidate
-
hand-crafted patters matching candidate answers
-
-
Neural Methods
-
DNN, uses SQuAD; computes embeddings for questions and embeddings for tokens
-
Compare embeddings, uses reading cqmprehension tasks
-
SQuAD dataset
-
Question $q$, passage $p$, $p_{s}(i)$ is probability that $p_{i}$ starts answer $p_{e}(i)$ is probability that $p_{i}$ ends answer span
-
Similarity scores used to predict start/end of answer span
-
Bi-LSTMs with attention, for question and passage; similarity scores used
-
$E$ is glove embedding
-
Input sequence of embeddings to get states
-
Combine into a query vector, being a weighted softmax sum, with $b$ and $w$ being learned weights
-
Passage embedded to vector sequence compared to $Q$
-
Computed similarly,using GloVE and PoS/NER tokens
-
-
Align attention between question and passage
-
-
Knowledge-Based
-
Answers in structured DB, no need for IR
-
Must be able to create a query
-
Requires Semantic Parsing to form SQL or predicate Calculus
-
Can be done using hand-written features or learned
-
Supervised
-
Dependency parsing, has things like tmod, nsubj
-
Uses labeled examples (question-query pairs)
-
learn general rules from inductive logic programming
-
-
Semi-suprevised
-
distant supervision/open information extraction
-
REVERB is an Open extractor
-
Extracts triples of strings and aligns to things like FreeBase (or other structured relations)
-
Match subject and argument to relations
-
Then learn mapping between natural language and FreeBase
-
DeepQA
-
IBM Watson!
-
Clinical Decision Support
-
Uses both methods
-
Many candidates proposed and scored by evidence
-
Stages
-
Question Processing
-
Answer generation
-
Answer Scoring
-
Confidence merging/ranking
-
-
Parsing, NER, relation extraction
-
Focus, answer type, question classification
-
Generate answers from structure/text sources
-
Queries using extracted relations
-
Eliminates stopwords, upweights terms from focus
-
Generates score vector from type, time/space relations, supporting evidence
-
Merge based on equivalence/combine scoring features
Evaluation
-
Mean Reciprocal Rank
-
Assume questions answers provided
-
If system under test returns list of answers, score is reciprocal of rank of first correct
-
Rank is rank of highest correct answer
-
-
Exact Match – fraction of predicted correct answers
-
$F_{1}$ score – harmonic mean of precision and recall
-
Datasets like SQuAD, TREX QA, TriviaQA, WikiQA
Lecture 11: Summarization
-
Similar to QA, but distills important information
-
Outlines
-
Extracts
-
Abstracts
-
Summary of threads
-
-
Single/Multiple documents
-
Generically, or focused on query
-
Steps
-
Content Selection – Chose text to use
-
Information Ordering – Decide how to order
-
Sentence Realization – turn into readable sentences
-
Single-Document
-
Single Document
-
Headline/outline brief extract
Content Selection
-
Binary classifier (important vs unimportant)
-
Unsupervised using salience
-
TD-IDF weight of term in document (tf: log of count + 1), (idf: log of docs over doc frequency)
-
Log likelihood ration, $\lambda(i)$, difference in how likely to see in documents vs in corpus (explicitly probabilistic), $-2\log(\lambda(i)) > 10$ (or other empirically chosen threshold), term is significant
-
-
Supervised
-
Training set with human-created summary labels
-
Extracts: label sentence with 1 if in summary
-
Extract features and train classifier
-
But requires summaries are extracts, or align source and summaries to find best match
-
Longest common subsequence, edit distance, hidden Markov models can be used for this
-
-
-
Features can be labeled with things like
-
position where sentence appears
-
Cue phrases
-
Informativeness, topic signature
-
Sentence length
-
Cohesion – relatedness
-
Information Ordering & Sentence Realization
-
Single document preserves order of sentences in order of the document
-
Simplify/compress extracted sentences removing less relevant to summary
-
Rule-based or Machine-Learned
Multi-Document
-
A group of related documents
-
News stories on an event, etc.
Content Selection
-
Unsupervised approaches best (little data compared to single-document)
-
Similar methods
-
But, must be aware of redundancy across multiple documents
-
Can use a penalty term to each sentence, and could be cosine or alignment
-
Clustering related sentences and choosing center is also possible
Information ordering
-
Can't sequence in order of appearance in a document
-
News articles or documents with dates can sort chronologically
-
Must maintain coherence
-
Adjacent sentences: $s_{1}$ causes $s_{2}$
-
Lexical chains: synonyms/hypernyms
-
Consistency: pronoun reference to same noun consistent and unambiguous
-
-
For each sentence, pairwise compatibility is a graph problem
-
Sentences are nodes, edge weights are compatibilities
-
Permutation maximizing sum of compatibilities
-
Form of TSP – intractable, heuristics exist
-
Sentence Realization
-
Still might need to improve coherence
-
Coherence resolution, apply some rewrite rules
Focused
-
Summary in response to particular queries
-
QA but for non-factoid QA
-
Many documents potentially
-
Overlap with query, cosine similarity
-
Specialized systems for query type
-
Information Extraction to identify important information
Evaluation
-
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) N-gram overlap between generated summaries and ground-truths
-
Gross equation (slide 20)
-
For each bigram, how many times did generated summary use it?
-
Normalized with total bigrams in reference
-
A recall measure
-
Rouge-L measures longest-common subsequence
-
Rouge-SU measures skip bigrams (additional words between pairs)
-
Pyramid method/summary content units focus on units of meaning more relevant
Lecture 12: Machine Translation
-
challenging, complex
-
Must be quick, compact, and often deals with low data transmission
-
Many variations despite fundamental similarities
-
Can't translate word-for-word, often need to add, align, delete
-
Different kinds
-
Rough, basic ideas, web search, tourism
-
Human-post-edited for high-volume CAHT
-
Fully automatic high-quality
-
Classical Machine Translation
-
Direct – word-by-word through input text
-
Transfer – parse, translate structure, generate output
-
Interlingua – parse input, translate to intermediate representation, generate output
-
Vauquois Triangle captures approaches
Direct
-
Morphological analysis, getting word forms
-
Dictionary translates individually
-
Reorder by rules
-
Morphological generation, ensures correct form
-
Ignores structure, idiom, etc
Transfer
-
Alter structure to conform
-
Leverages contrastive knowledge, differences between two specific languages
-
Steps
-
Analysis
-
Transfer
-
Syntactic, fixes syntax
-
Lexical, meaning from input to output
-
-
Generation
-
Interlingua
-
Language-independent form
-
Meaning is in a standard form
-
Semantic analyzer maps inputs to this
-
Representation could be predicate calculus, semantic decomp, event-based representations, etc
Statistical Machine Translation
-
A "noise channel model"
-
Probabilistic, outputs most probable translation
-
Given sentence $F$, $P(E|F)$ is probability output sentence is correct translation
-
Search for argmax $E$, by Bayes's rule
-
Probability of sentence $P(E)$ is from language model
-
$P(F|E)$ is from translation model
-
A decoder is used to output $E$
-
$F$ is a noisy translated $E$, and the goal is to recover $E$
-
Models
-
Phrase Based
-
Give $E$ and $F$, could go word by word, instead, use phrases
-
-
Group words into phrases, $I$ is number of phrases on english, $j$ is number of output frases
-
Reorder
-
$\phi(f|e)$ is translation probability
-
$d(a_{i} - b_{i-1}) = \alpha^{{a_{i} - b_{i-1} - 1}}$, $\alpha$ between 0 and 1 is distortion, based on movement of phrases
-
Probability is product of $\phi d$
-
Must train $\phi$!
-
Can be done with count/normalize method
-
Can count if training corpus includes translations and phrase alignments
-
Word alignments between inputs and outputs could also be used as phrase mappings
-
HMMs are generative for sequencses
-
Can use Viterby algorithm to generate sequence, Forward to generate sequence
-
Alignment Problem with $I$ states for each word, emission distribution is over output words
-
Most likely path generating sequence yields alignment
-
Symmetrization used for multi-word phrase alignment
-
All states are bidirectionnaly connected, including self-loops, each with a transition probability
-
Transition probabilities are based on a non-negative function $c$, on a jump width $i - i^{{\prime}}$
-
Neighboring words in input tend to neighbor in output
-
State-sequences not known
-
Baum-Welch algorithm to train HMMs used to train when sequences unknown
-
Starts with arbitrary values
-
Estimate expected counts
-
Use expected to re-estimate
-
Repeat until converge
-
-
-
Decoder
-
Takes input, outputs with maxmization
-
Searches through candidate space
-
Can't use gradient descent as not continuous
-
Exponential number, beam search/stack decoding is used for efficiency
-
Initialize with null
-
Pop best
-
If complete, return, else
-
for each expansion, score and push on to stack
-
-
Cost is combination of probability, and approx cost of completion
Evaluation
-
Human-based: people evaluate based on fluency, fidelity, how much is necessary to "fix" output
-
BLEU
-
Given output and reference, measure closeness
-
Weighted average of $N$-gram overlaps with reference translations
-
Modified precision, appearance based on max appearance in reference
-
$N$ grams for $N$ in 1–4
-
Brevity Penalty also used, based on length for shorter outputs
-
Easy to calculate, widely used
-
Doesn't consider meaning
-
Doesn't handle syntax
-
Not as good as human judgement
-
Lecture 13: Coreference Resolution
-
Identify noun phrases referring to same entity, that is, those that corefer to the same referent
-
Can be nested, important for MT for correct pronouns
-
Based on discourse model, incremental from entities in text and their relationships, first mention evokes, later access
-
Can also map to the real-world, entity linking to ontology
Background
-
Can refer
-
indefinite noun phrases
-
definite noun phrases
-
pronouns
-
demonstratives
-
non-stated pronouns
-
names
-
-
Information status
-
new/old
-
New could be new or old to hearer (if part of common knowledge)
-
Old were introduced previously
-
Some are inferred, based on known properties
-
-
Non referring expressions exist
-
Appositives
-
Predictives
-
Expletives
-
Generics
-
Tasks and datasets
-
Detect and cluster mentions
-
OntoNotes and ARRAU are common datasets
Mention Detection
-
IDentify spans of text for each mention, retrieves candidates and filters, all N grams <= 10
-
Filter extraction based on noun phrases, posessives, named entities, etc
Architectures/Classifiers
-
Entity-based represents each entity in a discourse model
-
Mention-based consider mentions independently
-
Models can also use ranking to compare antecedents
-
Mention-pair takes pair of mentions as input, outputs binary classification of coreference
-
hand-coded, or neural network
-
Each pair used for training
-
Filter negative, create pair of coreferants, create positive, create negative training for each for some k
-
Cluster mentions by pair top, and take transitive closure
-
-
Mention rank requires integration of resolution with anaphoricity by adding null
-
Estimate probabilities
-
At test time, pair top and take TC
-
-
Entity-based
-
Links mentions to discourse entities/mention clusters
-
Can use mention ranking approach, adding features
-
Or trains RNN to process clusters and combine hidden states
-
-
Handcoded features
-
Words/embeddings
-
Role
-
Length
-
Matching strings, etc
-
Neural Models
-
Span is contiguous sequence of $\leq 10$ words
-
For a span, estimate probability that each previous span is it's antecedent
-
$P(Y_{i}) = \frac{exp(s(i,y_{i}))}{\sum_{{y' \in Y(i)}}exp(s(i,y^{{\prime}}))}$
-
$s(i,j) = m(i) + m(j) + c(i,j)$
-
$m$ measures span is mention, $c$ measures if span is antecedent
-
$m$ is a feed-forward neural network
-
similar for $c$
-
-
-
Feed-forward NNs take input of vector representation (bi-LSTM w/ attention output)
-
Train my maximizing sum of probabilities of pairs in ground-truth
-
Minimizing cross-entropy
-
Evaluation
-
Precision and recall and combination, $F$-measure
-
$H$ is hypotheses, $R$ is references
-
Precision: $|H \cap R|/|H|$
-
Recall: $|H \cap R|/|R|$
-
$F$-measure: $2(precison)(recall)/(precision+recall)
-
MUC $F$-measure, H, R are links, pairs of mentions
-
$B^{3}$ are individual mentions
-
Entity Linking
-
Connected to coreference resolution, connects real-world to ontology
-
Detect mentions, disambiguate with supervised learning
-
They can help each other
Winograd Schema Problems
-
Kind of a sentence completion
-
Want to connect ambiguous references
-
Integrating world knowledge can assist in these issues
-
Winograd Schema Challenge include coreference problems solved with world knowledge but no automated methods
-
Somewhat solved with huge models
Gender Bias
-
Biases exist
-
Which pronoun connects more frequently with certain professions
-
Biased datasets
-
Gender-swapped dataset combined with original can help reduce
-
WinoBias, GAP datasets available
Lecture 14: Word Senses and WordNet
-
Dealing with ambiguity is hard
-
Words can have multiple senses (polysemous words), depending on context
-
WordNet is database of senses and relations
-
Word Sense Disambiguation determines sens of polysemous word based on context
Word Senses
-
Discrete representation of aspect of meaning
-
Discrete valued senses distinct from continuous embedding
-
Relationships between senses exist and can enhance semantics
-
Independent truth conditions, different syntactic behavior, and independent relationships? Do they exhibit antagonistic meanings
-
Zeugma is used for testing: "serve"
Relations Between Senses
-
Synonyms
-
Antonyms
-
Taxonomic, IS-A
-
Meronymous, PART-OF
-
Structured relationships, author/works-of-author
WordNet
-
150k+ words, nouns, verbs, adjectives, adverbs
-
Each has definition, synonyms, usage examples, etc
-
Synonym sets (synsets), of senses, based on senses, not words
-
Supersenses, categories of senses
-
Sense relations
-
Hypernyms, hyponyms
-
Instance hyper-/hyponyms
-
Meronyms/part meronyms
-
-
Can form a graph structure with labels on edges
Word Sense Disambiguation
-
Chooses correct sense based on context
-
WordNet, application-specific dictionary, translations, etc.
-
Semantic Concordance – corpus with words labeled with sense
-
SemCor labels Brown corpus with WordNet
-
Can map from wikipedia/wikidata/wiktionary
-
Evaluate with $F_{1} =2PR/(P+R)$ precision/recall based
-
Baseline is most-frequent sense
Embeddings: Contextual
-
ELMo or BERT often used for contextual embeddings
-
Embeddings depend on word's context
-
Different from context free embeddings like word2vec or GloVe
-
BERT trained by masking random words and predicting
-
-
Each labeled token of a specific sense, compute BERT embedding, sense embedding is mean of context embeddings
-
Predict sense, compute contextual, nearest neighbor among sense embeddings
-
Sense can be absent from training data, for each sense, look up others in synset, hypernyms, etc, and average embeddings!
Feature-Based
-
Supervised training of classifier
-
Features are words, embeddings, PoS around, etc.
Lesk algorithm
-
No model
-
Knowledge based
-
For $w$ choose sense whose description overlaps most.
Induction
-
Unsupervised
-
Induce set rather than using pre-defined
-
For each word in vocab
-
For each instance in corpus
-
Compute context vector
-
Cluster these vectors (each is a sense of $w$)
-
Centroid/median vector is a sense vector for each sense $j$ of $w$
-
When processing, use nearest sense vector
-