Introduction to Natural Language Processing

Lecture 1: Introduction

What is meant?
- Computers process structured data (databases, spreadsheets, etc)
- Most data is unstructured. Easy for people, hard for machines
- Studied since the 40s
- Progress in large part due to large datasets (corpus linguistics) and ML
Applications
- Part of speech tagging, identifying what's what in a sentence
- information extraction, identifying named entities (locations, organizations), relations (Lincoln is a city in Nebraska) or times
- Sentiment analysis, uncovering emotion/feeling in text
- Question answering, taking natural questions, finding and returning an answer
- Summarization, retrieving/reporting most important info in a document
- Machine translation
- dialogue, engaging in conversations with real users to find preferences/instructions and replying naturally
Approaches
- Primarily leveraging ML (uses large training sets to build models)
- Models used to classify/generate text, discriminating and predicting labels or output resembling training
Challenges
- English is pretty tough!
- Syntactically ambiguous
- Semantically ambiguous
Overview
- Topics introduced in Lecture
- Homeworks for practice
- Project is open-ended
- Topics
  - N Gram models
  - Na\"ive Bayes & Semantic Analysis
  - Logistic Regression
  - Representations/Embeddings
  - RNNs & Attention
  - Information Extraction
  - QA
  - Summarizsation
  - Machine Translation
  - POS tagging
  - Dialogue Systems

Lecture 2: $N$-Gram Language Models

Many ways to model text
Many are probabilistic, modeling probability distributionss over sentences
- Compare probabilities to make decisions
Models trained from data sets (corpora) by counting occurrences and normalizing
- Models based on $N$-grams, or word sequences of length $N$
- Smoothing is used to prevent 0 probabilities
Overall: models, evaluation, and smoothing
Let $W = \langle w_{1}, \ldots, w_{n} \rangle$ be sequence of words
Goal is computation of joint probability
Related: Compute conditional probability of next word given prior words
Probabilistic model allows us to compute one of these
Recall conditional probability (Bayes's Rule), $P(A,B) = P(A)P(B | A)$
Chain rule of probability, $P(w_{1}, \ldots, w_{n}) = \prod_{i }P(w_{i} | w_{1}, \ldots w_{i - 1})$
Estimation is from counting and dividing (this is the frequentist approach), $C(W)$ counts occurrences in corpus $\mathcal{C}$
Problematic for long sequences, as it may not occur, which may mean probabilities of zero or undefined
Instead, simplify and use bigrams!
- First order Markov assumption – the probability of the next event depends only on the event that just happened
- Generalizes to $n$th order assumption for tuples of $n + 1$ words
- Simplest case is the unigram model, where words are individual, and gives a 0 order Markov model, or probability is just the count over the size of the corpus
Berkeley Restaurant project has 9332 sentences, 1446 words
Sometimes things have non-existent bigrams
Probabilities then, can be calculated easily. Note, include start/end tokens
Probabilities can eventually cause underflow when multiplying!
- Instead, add logs of probabilities
- Since probability values aren't generally cared for, instead, comparison is the issue, and this is preserved
Works as long as $N$ isn't too large
- But looses long-range dependencies!
- Mostly an okay thing and long-range dependencies can be captured with RNNs
- SRILM, KenLM, Google Book $N$-Grams
Evaluation
- Training set used to compute probabilities
- Goal is to train such that model generalizes well!
- Estimate using a separate test set, which is ideally unbiased
- Performance on this is what we expect to see in practice
- Don't test on training data!!!!!!!
- Data is often split around 80/20 for training/test (90/10 also)
Common way to evaluate on test set
- Perplexity $PP(W)$, calculates inverse probability normalized by number of words $N$, $PP(W) = \sqrt[N]{1/P(w_{1}, \ldots, w_{N})}$, for bigrams, similar, with sum of bigram probabilities
- Minimizing perplexity on test set, maximizes test data's probability
Models can also be used to generate text!
- Generate a random bigram according to $P(\langle s \rangle, w_{1})$
- Generate bigram according to probability, and continue until we generate an end token.
Larger $n$ valued models are more coherent, and it's important to select an appropriate corpus!
Valid $n$ grams may not appear in training sets!
- Smoothing allows prior information to be incorporated into the model before data is used to update it
- Laplace Smoothing
  - If zero probabilities come from zero count,s then ensure no counts are zero!
  - When counting $n$-grams, initalize each counter to 1 rather than 0 (pseudocounts)
  - Probability estimates are based on modified counts, and change estimate to a Laplace form, $P_{Laplace}\langle w_{i} \rangle = \frac{C(w_{i}) + 1}{N + V}$, where $N$ is the size of the corpus, $V$ is the vocabulary
  - Amount added can be $k \neq 1$ depending on weight of the prior
- Backoff use trigram if counts non-zero, else bigram or lowern
- Interpolation, mixes unigram, bigram and trigram models
  - Fix constant parameters $\lambda_{i}$ to weight each term where $\sum_{i} \lambda_{i} = 1$
  - Or have values depend on context
  - Tune $\lambda$ on independent hold-out set/use cross-validation
- Kneser-Ney Smoothing
  - Most common method
  - Based on absolute discounting
  - Redistributes probability mass from common to less common, simply subtracting $d = 0.75$ works well, and then add interpolation on unigrams.
  - Extends by including a term which models probability that $w$ will be in a novel continuation, how many words will pair with $w$
  - Final formula kinda long (slide 27)
- Stupid Backoff
  - Stupid as it doesn't care about renormalizing to a true probability distribution
  - Roll back to shorter unigrams if counts = 0, and don't renormalize since it's too much time

Lecture 3: Na\"ive Bayes & Sentiment Analysis

Classification is core area of ML & NLP
Many tasks, includes spam filtering & sentiment analysis
Simple generative model, na\"ive Bayes, used for classification

Discriminitave vs generative

Discriminative models take an instance as input, and predict a label
- Real-valued labels are regressions
- Finite-set labels are classifications
- Training comes from a training set with true labels (supervised learning)
- Then searches set of possible models optimizing for some objective (minimizing prediction error, etc.)
- Many model types
  - Decision trees
  - ANNs (discussed later)
  - SVMs
  - nearest-neighbor
Generative models define probability distributions
- Generate new instances resembling training data
- HMMs, Bayesian networks, probabilistic models, na\"ive Bayes
- Trained via unsupervised learning, that is, lacking labels; note, labels mean generative model can exist for each class
- Can be used to discriminate estimating probability that a given instance is generated

Classification

discriminative models classify text into finite set of categories
Subject classification, spam filtering, author ident,
Given training set output a classifier $\gamma: \mathcal{D} \rightarrow \mathcal{C}$ predicting labels of new documents
Representing each document is important
- Option: each document is a bag of words with counts associated with them
  - Simple model, looses position information
  - May need to drop some things

Na\"ive Bayes

Recall Bayes Theorem: $P(c|d) = (P(d|C)P(c))/P(d)$
- The probability of the class given the document
- $P(c)$ is the prior probability of label
- $P(d)$ is the probability of the document
- $P(c|d)$ is probability of $c$ given $d$
- $P(d|c)$ is probability of $d$ given $c$
Product Rule $P(A \land B) = P(A | B)P(B) = P(B | A)P(A)$
Sum Rule $P(A \lor B) = P(A) + P(B) - P(A \land B)$
Total probability
- If events are mutually exclusive, and sum of probabilities = 1,
- $P(B) = \sum P(B|A_{i})P(A_{i})$
MAP/Maximum a Posteriori classification
- Given document, our goal is to find the most likely label
- Done using argmax
- $argmax_{c \in C} P(c|d)$, or, assuming feature representation, of $P(x_{1}, \ldots x_{n} | c)P(c)$
- $P(c)$ is easily estimated, fraction of documents that are class $c$
- Estimating $P(x1...xN|c)$ is harder, as the number of combinations is exponential, and there won't be sufficient data
Make the Naive Bayes Assumption!
- $P(x_{1}, x_{2}, \ldots, x_{n} | c) = \prod_{i} P(x_{i}|c)$
- Assumes, of course, that feature values are conditionally independent given the class
- Classifier is $c_{NB }= \mathrm{argmax}_{c \in C} \prod_{i} P(x_{i}|c)$
- Values needed are polynomial!
- Standard method, and incredibly practical
- Works well when you have moderate/large training sets
- when attributes that describe instances are conditionally indepent given classes
- Learn Algorithm (See slide 15)
  - Estimate $P(c)$, call $\hat{P}(c)$
  - For each word, $w_{i}$ in vocabulary, estimate, $\hat{P}(w_{i}|c) = \frac{count(w_{i},c)}{\sum_{w\in V} count(w,c)}$
- Classify:
  - Compute above $c_{NB}$ using $\hat{P}$
- Smoothing done using Laplace smoothing with Pseudocounts!

Sentiment Analysis

Replace count-based bag-of-words with binary indicators
Negations have to be handled

Lecture 4: Na\"ive Bayes & Sentiment Analysis, Continued

Evaluating Classifiers

Goal: generalization, and estimating performance
But, other goals possible
- Estimate generalization performance of model/hypothesis
- Tuning parameters
- Comparing two algorithms on specific task
Can't answer with perfect certainty
- Estimator must be chosen, determining the distribution thereof, and bounding it
- Must work regardless of distribution of training/test data
Statistical variation related to specific application is significant
Plan experiments well, don't test on training data!!
- It biases the performance estimator
- And don't let duplicates into training/testing sets if using data augmentation
- Holds for validation set used for early stopping/tuning parameters!
Confidence intervals are used
- $error_{\mathcal{D}}(h)$ is the 0-1 loss hypothesis on instances drawn according to $\mathcal{D}$. If test set has $N$ examples drawn independently (of $h$)
- 0-1 loss is the probability that $h$ makes mistake
- and $N \geq 30$
- Then, with 95% probability, the error likes in $error_{\mathcal{V}}(h) \pm 1.96\sqrt{\frac{error_{v}(h)(1 - error_{v}(h))}{N}}$
- So, if the hypothesis misclassifies 12 of the 40 examples, with 95% confidence, the error is in that range.
- Confidence interval can be generalized by changing real-number coefficient
  - Based on binomial distributions, mean error of an unbiased estimator
  - Can approximate binomials by normal distributions
  - And so, then, based on standard deviations in normal distribution
  - Can be justified due to the Central Limit Theorem
- Steps
  1. Pick Parameter to estimate
  2. Choose Estimator
  3. Determine Probability Distribution
  4. Find appropriate interval

Comparing Learning Algorithms

Can't just compare error rates on single test set
Use instead $K$-fold cross validation and paired $t$-test

K-fold cross-validation

partition data set $I$ into $K$ equal sized subsets, of at least size 30
For i from 1 to k
- use $X_{i}$ for testing
- Let the training set be all less $x_{i}$
- Train each algorith
- Let $p_{i}^{j}$ be error on test set
- $p_{i} = p_{i}^{1} - p_{i}^{2}$
Error difference is estimated as sum of $p_{i}$ all over $K$
Estimate confidence that true error is < 0.
Confidence that $L_{I}$ is better than $L_{2}$ on the task
One-sided test with confidence from $t$ distribution with $K - 1$ DOF
Then calculate, true difference of expected error is $p + t_{c,K-1} s_{p}$
- $s_{p} = \sqrt{\frac{1}{K(K-1)} \sum_{i=1}^{k} (p_{i} - p)^{2}}$
- Then, check if $p < - t_{c,K-1} s_{p}$
Must understand that to show that learning algorithm performs better, at 95% confidence, 5% chance of wrong
- Overall confidence is 18.5% chance that at least one is wrong
- Must account or use a more appropriate test!

More specific measures

Prediction error may not tell the whole story!
Bad test data may mean that accuracy isn't helpful
Confusion matrices can be used instead
- True class vs predicted class
- Cells for true positives, false positives, true negatives, false negatives
- Generalizes to many classes
- Normalized to fractions and represented as heat map
Precision-recall curves
- Precision is fraction of retrieved that are positive
- recall is fraction of positives retrieved
- Can vary threshold to trade precision for recall
- Compare curves based on containment
- More suitable than ROC for some large numbers of negatives
- Uses $F_{\beta}$ to combine at a specific point

Lecture 5: Logistic Regression

Recall Na\"ive Bayes, which is generative
Logistic regression is discriminative
Weighted sum of features of instance, and processing with sigmoid or softmax, which is probability of $x$ being class $c$
Output is probability, performance measured with cross-entropy, a loss function
Optimize using stochastic gradient descent
Logistic models built into neural nets for deep learning

Sigmoid

Each instance is a vector $x = [x_{1}, x_{2}, \ldots, x_{n}]$ of features
Supervised training, each instance has label, $y \in \{0, 1\}$
Output of model on input is $P(y = 1 | x)$
Weighted sum $z = wx+b$, where $w_{i}$ is feature weight, $b$ is the bias term
Final output is $P(y = 1 | x) = \sigma(z) = 1/(1 + e^{{-z}})$
Discrete valued prediction thresholded at 1/2
Called logistic/sigmoid function
Squashes $z$ into $[0, 1]$! (sometimes in older stuff called a "squashing function")
Assume classifier is trained with given values and biases. (5.1.1 from textbook)
- Compute weighted sum, run through logistic function, and compare to threshold

Training

$w$ and $b$ must be discovered
Given labeled training corpus, map documents to feature vectors, and attach label
Initial $w$ and $b$ randomly!
Evaluate each training vector predicting $P(1|x)$ and use loss function $L$ to compare prediction to true label
stochastic gradient descent used to update $w$ and $b$, reducing $L$ on $x$
Repeat

Loss Functions

Many available
Cross-entropy used for classification/prediction of probabilities
$\hat{y} = \sigma(wx + b)$ is the prediction, $y \in \{0, 1\}$ is true label
$L_{{CE}}(\hat{y}, y) = -(y \log \hat{y} + (1 - y)\log(1 - \hat{y}))$
Since $y$ is binary, one term is $0$ and $L_{CE}$ is negative of log of probability of correct class. Larger predicted probability, the smaller the loss!

Stochastic Gradient Descent

Goal is to find params, $(w, b)$ or, $\theta$ overall and minimize average loss on entire set!
$\theta^{*} = \argmin_{{\theta}}\frac{1}{m} \sum_{{i = 1}}^{m} L_{{CE}}(\hat{y}^{(i)}, y^{(i)}; \theta)$
Continuously differentiable, as $L_{{CE}}$ is by def'n, allows for gradient descent
Consider a linear unit: output is weighted sum of inputs with no $\sigma$ applied (weighted sum, bias, no sigmoid!)
If there's only one feature, this is regression, that is, find a straight line that best fits training data
- Let pass through origin, $b = 0$
- Slope is just a single parameter, $w_{1}$
- $J(w_{1}) = \sum_{i}L(\hat{y}^{{(i)}}, y^{{(i)}}; w_{1})$
- Goal is to minimize $J(w_{1})$
- Writing generally, can be plotted and calculated.
- Take derivative & solve for 0
Calculated using $w^{{\prime}} = w - \eta \nabla J(w)$
- $\eta$ is the learning rate
- Gradient, $\nabla J(w)$, is partial derivatives $\left[ \frac{\delta J}{\delta w_{i}}\right]_{i=1}^{n}$
  - $\frac{\delta J}{\delta w_{i}}$ is a partial derivative is how much a change in $w_i$ changes $J$
- $J$ is error, compute gradient of $J$ with respect to parameters, apply gradient descent
  - Cross entry loss function with respect to each weight (see textbook for function)
  - $x_{j}(\sigma(wx+b) - y)$ is the partial derivative for single instance
  - mean of partials for multiple instances
  - Plug into formula to get weight update
  - Update is smaller if prediction is closer
  - As magnitude of feature grows, $w_{j}$ becomes more responsible for more error
  - If $\hat{y} > y$, $wx + b$ needs to decrease
Standard GD, mean of vectors across entire set
- Esitmate of expected gradient on random instance
- Can require huge memory
Stochastic Gradient descent partitions, randomly, into training set minibatches of size $m$
- Compute average per minibatch
- Update per minibatch
- Each pass through all is an epoch
- Saves memory
- Gradient estimates have high variance, avoid getting stuck in local minima in non-convex situations

Regularization

Is perfect fit the best model?
Recall, the goal is to generalize well, there may be noise in data afterall
Perfect fits tend not to generalize well
Training accuracy vs simplicity
Want moth accuracy and simplicity
$h$ is a model, training error on $\mathcal{X}$, $error_{{\mathcal{X}}}(h)$ is sum of loss function over size of dataset
Generalization error, $error_{{\mathcal{D}}}(h) = \mathbb{E}_{{x \tilde \mathcal{D}}}[L(h(x), y)]$
Model overfits if there's an alternative model such that $error_{{\mathcal{x}}}(h) < error_{{\mathcal{X}}}(h^{{\prime}})$ and $error_{{\mathcal{D}}}(h) > error_{{\mathcal{D}}}(h^{{\prime}})$

Regularizers

When many params, more likely to overfit
Training adds regularizer, $R(\theta)$ to argmin, weighted with $\alpha \geq 0$
L2, square of norms
L1, norm of vectors

Multiclass

Done with "Softmax", when $> 2$, one set of weights and one set bias for each class
Then, $z_{c} = w_{c} x + b_{c}$
Probability distribution calculated using softmax, not $\sigma$
- $\mathrm{softmax}(z_{c}) = \frac{exp(z_{c})}{\sum_{{c^{{\prime}} \in C}} exp(z_{c^{\prime}})}$
- Generalize cross-entropy to handle multiclass, y is generalized
- Labeled with one-hot vector
- When training, $\delta(-\log \hat{y}_{c^{*}})/\delta w_{i}$ for each weight, update with SGD

Lecture 6: Vector Semantics and Embeddings

Vector representations of words
Success of tasks depends on document/word representations
Bag-of-words representation from ch4, engineered features from ch5
- BOW, word corresponds to dimenion in a $|V|$-dimensional vector
Dense vector representation
- embedded representation, represents words in numeric vectors of smaller (10, 50, maybe more) dimensions
- Want similar words to cluster
- Similar contexts imply similar meanings
- Learn embedding by grouping words based on context
Consider: lexical, vector semantics; words and vectors; cosine similarity; TF-IDF; Word2Vec

Lexical Semantics

How do words relate?
Synonymy – similar meanings
Antonymy – opposing meanings
Similarity – based on what is similar, "car" and "bicycle"
Relatedness – not similar, relating to each other, "coffee", "cup"
Semantic field – covering some sort of semantic domain

Vector Semantics

Want to numerically represent words such that those that share some sort of lexical semantics cluster
Because regression models (and NNs) require numeric inputs!
Trained model that predicts positive when vanish appears should respond similarly to disappear
Representations grouping words by context
Two-dimensional projection of a 60-dim embedding (t-SNE)

Words & Vectors

Term-Document matrix, $|V| \times |D|$ matrix. Column $d$ is $d$'s BOW representation
Even bag of words representation clusters!
Rows of TD matrices can represent words
- Context then is the set of documents that it appears in and how often
- Similar words will have similar representation
Word-Word or Term-Context matrix can be used ($|V| \times |V|$)
- $(i,j)$ hold the number of times $i$ and $j$ appear in the same context
  - Context can be document, fixed window, etc.

Cosine Similarity

Compute similarity of words/documents
Often, cosign of the angle \[\frac{v \cdot w}{\Vert v\Vert \Vert w \Vert} = \frac{\sum_{{i = 1}}^{N} v_{i} w_{i}}{\sqrt{\sum v_{i}^{2}}\sqrt{\sum w_{i}^{2}}}\]
Normalize vectors by dividing by length and take dot product of unit vectors
- 1 if identical, 0 if orthogonal

TF-IDF: Term Frequency-Inverse Document Frequency

Must be careful about context definition
Some words appear frequently in any corpus
Can inflate similarity measures!
Weights terms by frequency, reducing weight if appearing in many documents
Term Frequency is the count of $t$ in $d$ +1 to avoid log zero \[\mathrm{tf}_{t,d} = \log_{10}(\mathrm{count}(t, d) + 1)\]
Log used to differentiate between orders of magnitude \[\mathrm{idf}_{t} = \log_{10}\left( \frac{N}{\mathrm{df_{}}_{t}}\right)\]
- $N$ is documents
- $\mathrm{df}_{t}$ nuber containing $t$
Weight of $t, d$ is $w_{{t,d}} = (\mathrm{tf}_{{t,d}})(\mathrm{idf}_{t})$
Word-word matrix then weight by tf-idf
Vector for term computed as normal
normalize/compute cosine similarity

Word2Vec

From Mikolov et al.
vectors from tf-idf long, sparse
Dense, short vectors prefered
- Fewer params to learn!
- Captures synonymy better!
- Turns out, it works better in practice!

Intuition

Given $w$, rather than count how often it appears near a word
Predict how likely to appear near it
Trainable using logistic regression
Supervised, since any word appearing near another is a positive example
Negative sampling also needed
Weights can be used as embeddings

Training

Given $t$
and context words $c$ as positive example
Sample for negative examples, and count, divide by words not found (as probability) for some parameter $\alpha < 1$
Train with logistic regression, random weights
Final weights are given embeddings
Then, given $t$ and $c$, predict probability near each other. $t$ and $c$ are represented as embeddings, similarity is dot-product; probability, from logit function
- Probability of negative pair, is above from 1
Given positive pairs and negative pairs, maximize log sum of positives + log sum of negative
- Optimizing finding an embedding maximizing $c \cdot t$, and minimizing $n \cdot t$ for all $t$
$w$ can be both target and context. Training gives two embedding matrices, a target and an context
- These are the regression parameters, row $w$ from $T$ is $d$-dim representation of word, column $w$ of $C$ is $d$-dim representation
- $T$ is often final embedding
- concatenation could be option

Visualization

High dimension representations are hard to visualize
Hierarchically clustered dendograms are option
t-SNE projects to 2 dimensional space

Semantics

Embeddings can allow semantic arithmetic
- $(king - man) + woman \approx queen$
Adjective to comparative to superlative
Country-capital common distance
Can reveal biases!

Miscellaneous

word2vec applied to Materials Science paper abstracts
- Predicted clustering groupings in table
- Predicted properties of materials before scientists discovered them
.*2vec exists and extends beyond text
- node2vec embeds representation of graphs
- Context must be defined for embedding

Presentations: Project Ideas

Deleting

Movie Critic Analysis
- A form of sentiment analysis, but perhaps on a continuum
- Training a model to discriminate between professional and amateur critics
- IMDb and Stanford Sentiment Treebank datasets
Music Analysis
- Why are songs more popular than others
- Why things related to ranking happen
- Sentiment analysis on lyric
  - predictive from sentiment, etc.
- Datasets from Billboard charts
Professor Ratings
- Trends, correlation with language
- RateMyProfessor datasets, but most focus on correlation between numerical scores

Scooby Data Doo

Intent Classification
- Assigns document intent
- Useful for businesses to provide service, chatbots, automated assistants
Toxic Comment Classification
- More interactions online, but people are kinda toxic
- Understanding interaction is beneficial in making the internet a safer place
- Filtering social networking sites
- Understand how hate speech etc., spread through network sites

Google Translate 2.0

Sentiment Analysis
- Marketing use
- Employee satisfaction
- real-time information
- public opinion
- spam detection
Dialog Generation
- Goal-oriented
- Conversational systems
- Often a mix of the two
- Customer Service
- Chatbots
- Blended approach
- Pattern recognition, state machines, Markov processes, vector space, neural networks
Grammar Correction
- Spelling, grammar, punctuation errors
- Sequence tagging
- CNNs
- transformer-models

Proof By Intimidation

Taxonomy Translation
- Experts use special, domain ontologies for efficient communication, jargon
- Developing models that can be used to translate from experts to non-expert
- For a domain, learn an ontology for experts; then a general, for laymen; find connections between the two
Common Sense
- Reasoning and Validation in language
- Uses understanding and generation
- High accuracy, only text-to-text
- Applicable in many areas: taxonomic reasoning, recommendation systems, chatbots, MT

Latural Nanguage

Speech processing
- Recognizing spoken language
Semantic Analysis
- Understanding meaning
Lemmatization
- De-inflecting words, returning to original form
Text Classification
- Categorization
Summarization
- Simplify document while communicating core message
- Not always generative
- Removing redundant information
- Abstracting/extracting
Sentiment Analysis
- Exactly what's been previously covered
- Aspect-based analysis, looks for multiple indicators/sentiments

Homo Bayesians

Data to text generation
- auto-correct/text completion
- Various sorts of neural models, BOW, RNNs w/ Attention, etc
- Text to SQL, etc
Dialogue State Tracking
- Component of dialogue system
- Various ways of handling state
- Belief Tracking

Bricc Squad

POS tagging
- Solves other NLP problems
- Helps with language learning
- Meaning between homophones
- Lexical
- Rule-based
- Probabilistic
- Deep Learning
Missing Elements
- Filling things in
- Missing elements in foreign languages
- Fused head ident (find phrase missing noun; rule based or ML)
- Fused head resolution (finding in context)
- Bridging anaphora
- Bridging anaphora resolution

Lecture 7: Neural Networks and Neural Language Models

Consider human neural nets
- $10^{10}$ neurons
- $10^{-3}$ switching time, $10^{{-10}}$ switching in silicon
- Connections per neuron, $10^{4}$ to $10^{5}$
- Scene recognition is incredibly short
- 100 inference steps, which doesn't seem like enough
- But computation is massively parallel
Goal is to simulate this
- using neuron-like switching
- weighted interconnections among units
- highly parallel, distributed
- Want to tune weights automatically
- Many layers learn to extract features, rather than manually creating this features
- Note: major differences between ANNs for ML and ANNs for modeling biology
Used
- high dimensional input, discrete or real-valued
- output can be discrete or real-valued
- Output is vector
- Data may be noisy
- Target function form is unknown
- Readability is unimportant, and long training is acceptable
History
- Linear units, perceptron algorithms (1940)
  - couldn't handle data not linearly separable
  - Couldn't train multi-layer networks, but aware of usefulness
- Multilayer networks trained with Backpropagation (1980)
  - Replaced with SVMs, Boosting in 1990s
- Deep Architectures (2000s)
  - Better hardware, software support for deep networks
  - Uses backprop withh larger datasets, algorithm improvements
    - improve performance considerably
  - Impressive applications
- The Inevitable — Skynet, TBD

Basic Units

Takes weighted sum of input, $z$
Passes through non-linear activation function $f$
If $f$ is logistic, it's the same as logistic regression function
Single nodes are limited in what they represent
Linear Threshold Unit, the perceptron
- Activation thresholds weighted sum at 0
- Represents things like $\land$ well
- Other functions aren't representable!
- Because they're not linearly separable, XOR
- Would be a network of units

The XOR Problem

XOR, intersection of two separators
- $g_{1}(x) = 1x_{1} + 1x_{2} - 1/2$
- $g_{2}(x) = 1x_{1} + 1x_{2} - 3/2$
- So, $z__{}i = 0, g_{i} < 0$, $1$ otherwise
- Now, $g(z) = 1z_{1} - 2z_{2} - 1/2$
Remaps vectors $x$ to $z$ so that separable in new space
- Input layer, output layer
- hidden layer is $g_{1}$, $g_{2}$
- Two-layer perceptron/feed-forward neural network
- Non-linear activation in hidden output
- Many dozens of hidden layers, several output nodes
Activation functions: $z$ is weighted sum
- Linear unit
- Sigmoid/Logistic
- Softmax (generalizes sigmoid)
- Tanh
Hidden nodes
- Rectified Linear Unit, max 0, z
  - Good default
  - Variants to handle $<0$ differently

Backprop

Training multiple layers means tuning parameters in all
Uses gradient descent minimizing a loss function
Cross-entropy is a common loss function
- $L_{{CE}}(\hat{y},y) = -\sum_{{c \in C}}y_{{c}}\log\hat{y}_{{c}} = -\log\hat{y}_{{c^{*}}}$
Must compute gradient of loss function for each param, even several layers from output
Backprop feeds forward inputs to outputs, propagates back error via chain rule for derivatives
- Simply, modularly decomposed

Computation Graph

Given complicated function, want to know partial derivative with respect to parameters
$f$ is represented via modular fashion
Decompose gradient into basic op
- $\frac{\delta f}{\delta f} = 1$
- Then partial with respect to edges leading, $\delta f/\delta a$ = 1, because partial with respect to sum is 1
- Then, with respect to $x_0$ is $1w_0 = 3$ (via chain rule)
- For $x = [1.0, 4.0]$, $\nabla_{w} f = [1.0, 4.0]$
Then, for $x = [1.0, 4.0]$ for $f(x) = \sigma(x)$, $\nabla_{w} f = [0.1966, 0.7866]$ (once it's done all the way through)
Modular, once we have a formula for gradient, we can apply anywhere in larger graph
Generalize to $J(w) = \frac{1/2}\sum_{{i=1}}^{{n}}(\hat{y}_{{i}}^{t} - y_{i}^{t})^{2}$
- $w_{{5*}}$ and $w_{{6*}}$ tie to output units
- Gradient/weight updates as before, then
  - $w^{{\prime}}_{{53}} = w_{{53}} - \eta\frac{\delta J}{\delta w_{{53}}}$
Multivariate chain rule says sum path from $J$ to $w_{{42}}$ – looks kinda ugly
- Computationally simple! Compute for specific input
- Modular form means once things are computed, plug values in and compute for earlier layers

Backprop algorithm

Submit input $x$, fed forward to outputs
Compute network loss
Prop error to compute loss gradient with respect to each weight, update weights
Automatic in TensorFlow, automatic differentiation based on computation graph

Putting Things Together

Hidden Layers

How many?
Deep networks build potentially useful representation via composition
- Layer is vector-valued, non-linear
Performance improvement isn't only from more complex networks
Increasing layers increases chance of overfitting, must have lots of data with deep networks, needs longer training
Boolean functions representable with two layers
- Bounded, continuous function represented with arbitrarily small error with two layers
- Function with arbitrarily small error in three layers
- Only an existence proof needed
- Exponentially many nodes in layer
- Might not be able to find correct weights
- Overfitting and need for regularization noted

Regularization

Prevents overfitting
Add penalty term $R(\Theta)$ that is, the sum of the sizes of parameters
Other methods:
- early stopping – when performance worsens on a validation set
- Dropout – ignore random subset of edges
- Data augmentation – copies of random instances with noise
- Inject noise into hidden layers
- Batch normalization of output before input to next layer
Be careful not to underfit also (overfit, then regularize)

Initialization!

Early on, random numbers near 0, $\mathcal{N}(0, 1)$
- Nearly linear for logistic function, so GD works better
- Deep networks increases variance, poor optimization due to vanishing gradients
Glorot initialization per layer, if has $n_{in}$ and $n_{out}$, initarize uniform over $[-r, r]$ or $\mathcal{N}(0, \sigma)$
- $r = a \sqrt{\frac{6}{n_{in} + n_{out}}}$
- $\sigma = a \sqrt{\frac{2}{n_{in} + n_{out}}}$
- $a = 1$ for logistic, $a = 4$ for tanh, $a = \sqrt{2}$ for ReLu

Optimizers

Variants on gradient descent
Momentum optimization
- Momentum term $\beta$ keeps updates moving in same direction
- Update looks like $w^{{\prime}} = w - \beta m + \eta\nabla J(w)$
- Moves through small local minima to better ones and along a flat
AdaGrad
- Standard descends too quickly, then crawls too slowly
- $w^{{\prime}} = w - \eta\nabla J(w) \oslash \sqrt{s + \epsilon}$
- $s = s + \nabla J(w) \otimes \nabla J(w)$
- circled ops are elementwise
- $s$ accumulates squares of radient and learning rate for each dimension scaled
RMSProp
- Exponentially decays old gradients to address this, adds $\beta$ to do so.
Adam
- Combines Momentum and RMSProp
- Uses iteration counter to prevent vanishing, etc.

Language Models

Recall language model can take input sequence, and predict probability over next words
Feed-forward does the same, takes input sequence, outputs probabilities
word2vec embeddings used
Pre-trained (faster) or trained with network (may be better)
Can fine-tune a pre-trained embedding to save time
Fixed word sequence input
Softmax output compared to target word
Trained with cross entropy loss
Fixed word sequence input
- Embedding learned with model
- Softmax compared to output
- trained with cross-entropy loss

Summary

Neural units take weighted sum, and applies non-linear activation
Multilayer networks used because single unit is limited
hidden layers allow learning of own representations – nonlinear feature extraction
Stochastic GD used for training
- Calculated using back prop
Neural models predict word after sequence of words
- Can use pre-trained embeddings or learn their own.

Lecture 8: Sequence Processing with Recurrent Networks

Earlier required fixed-size input
RNNs work on sequences, like text, biological sequences, video, audio, etc
Can create novel output, text in a specific style, music, etc

Basics

Recurrent cells have connections going backward as well as forward
- At $t$, neuron receives input $x_{t}$ and its output $y_{{t - 1}}$
Recurrent layer can be built, where each node gets the vector $x_{t}$ and the vector $y_{{t - 1}}$ (that is, the whole layer's output)
Output depends on past, that is, memory/state
- State at $t$ is $h_{t} = g(h_{{t - 1}}, x_{{t}})$, and output is $y_{t} = f(h_{{t - 1}}, x_{t})$
- $h_{t}$ then stores important information about input
- Forward pass – evaluate forward by unrolling in time

Training

uses backprop through time
Unroll through time, use backprop on expanded graph
Forward pass batch of sequences through network yields output
Output sequence evaluated using cost
Gradients propagated backward through unrolled network and params
Sequence loss, comparing two sequences
- weighted average of cross entropy across sequence, which can target certain parts of sequence
BPTT means gradient flows farther, can "vanish" or "explode"
- Can happen with any network, RNNs very susceptible
- Clipping to range $[-1, +1]$ mitigate explosion
- Can also do batch normalization

Applications

Recurrent NL Model

Can use recurrent architecture to model language
Can predict netxt word in sequence given current & previous hidden
Train by feeding sequence of word, output softmax over next, train with cross entropy

Generating Sentences

Feed in a start, generate output, output fed into network as next input, continue until end or length limit
- Hidden state fed in with random/all zeros

Labeling

Given sequence of words, categorize each
Softmax/argmax over classes, POS taging, named entity recognition
Can also classify an entire sequence, called a sequence to vector model
- ignore all output except last
- Compute loss on final output and apply BPPT

Stacked & Bidirectional RNNs

Can stack RNN layers on each other
bidirectional – process separately from end towards start
- run in opposite directions, outputs combine in some way (concat, add, etc)
- Can be used for sequence classification

LSTM & Gated Units

Long-term memory – vanishing & exploding gradients can be problematic
Can suffer from long times with long sequences
First inputs have diminishing impact
Add "long-term" memory
Accumulates information, and can decide to forget

LSTM

Vectors, $h_{t}$ is short-term, $c_{t}$ is long-term state
At $t$, some memories from $c_{{t - 1}}$ can be forgotten in forget gate, and new ones from input gate are added
Results are sent as $c_{t}$, $h_{t}$ and $y_{t}$ are from processing long-term in output gate
$g$ combines input with previous state
$f, i, o$ are gate controllers, vectors
$f$ is size $n$, and 0 or 1
$i$ controls remembering $g$
$o$ controls what long term goes out and to hidden
tework learns to remember based on input and last hidden
Peephole can be added, lets last long-term affect controlls

GRU

Simplified version of LSTM
Merges long-term and short-term states
Merges forget and input controls

Words, Subwords, Characters

Character level modeling sometimes used
- Can be helpful to have more detailed information
- Can use other RNNs to process at finer level and combine with word-level embeddings
Can be done using a birnn, and concatenating input, giving a character-level word-embedding

Lecture 9: Encoder-Decoders, Attention and Contextual Embeddings

Text generation of model generalized/applied (Machine Translation)
Encoder-decoder architecture
- Encoder takes input, produces embedded representation, context
- Decoder generates output from context
Generalized further by going from context vector to sequence contexts and attention mechanism

Text Generation

NLM can generate text sequence
Must feed start symbol
- $y_{{t}} = f(h_{t})$, $h_{t} = g(h_{t - 1}, x_{t})$, $x_{t} = y_{t - 1}$ (this is an autoregressive model)
Can generalize from a prefix (NLM completes existing sequence)
- Prefix input triggers hidden state vectors, outputs from prefix ignored
- hidden state captures context
Can be used for MT
- NLM trained on bitexts – sentence concatenated with translation
- Feed sentence 1 then generate translation

Encoder-Decoder Network

Previous model generalizes to encoder-decoder
Encoder input generates states
Context vector is a function of hidden states, the embedded representations
Decoder takes $c$ and creates new states and outputs
Encoder can be any RNN, stacked biLSTM, etc (same with decoder)
- c is final step, top layer encoder concatenated
- Initialise decoder's state = c
- Decoder state is function depending on output, previous state, context

Attention

Context vectors passed from encoder to decoder are embedded representation of input sequence
- $c$ represents the entire input
- Can't ensure context vector is sufficiently large to represent context!
- Can use attention mechanism
  - context becomes sequence of vectors
  - Decoder focuses on relevant subset
Static context vector becomes sequence of context fectors
- Uses score function, $score(h_{{i - 1}}^{d}, h_{j}^{e})$ gives relevance of encoder state to decoder state, likely uses some weight function (bilinear interpolation, can be learned!)
- Concatenate scores over j into relevance vector
- Normalize with softmax, and sum over j by encoder states
Architecture
- Compute context vector at each step
Attention $\alpha_{{ij}}$ shows probability that output $y_{i}$ is aligned to a given input $x_{j}$

Lecture 9: Information Extraction

nat lang is unstructured, difficult to analyze automatically
IE identifies important information puts into strucutred form
- named entity recognition (identify type of nouns)
- relation extraction (ident. semantic relations among entities)
- event extraction (ident events and organize temporally)
- template filling (analyze docs, fill out standard forms)

Named Entity Recognition

Named entities are anything refered to by a proper name
Requires identifying what text is the entity (using a span of text), and then classifying it
- Can classify as people, organizations, locations, geo-political entities, facilities, etc.
IO/IOB tagging
- Inside, outside, inside/outside/beginning tags

Neural Method

Character level embeddings, GloVe word embeddings
BIDI LSTM outputs concatenated and fed into softmax
Output is distribution over IOB labels
Ignores dependencies
- Conditional random field used to constrain outputs
- Probabilistic graphical model, edges denote influence
$x$ is input, $y$ is possible predictions,
- $s(x, y) = \sum_{i} A_{y_{i},y_{i + 1}} + \sum_{i} P_{i,y_{i}}$
- $P$ is softmax probability, $A$ is transition cores, from tag $i$ to tag $j$
- Then, calculate softmax over scores

Evaluation

Recall, Precision, $F$ Measure used
Be careful with result interp, what may initially appear as one error can be multiple

Relation Extraction

Comes after NER
Relations depend on domain
Classes exist for each entity
Then must learn to classify and connect relations
Ordered tuples, many different forms of notation
Relations depend on domain, and many have specific models

Pattern-Based

Uses static patterns to recognize relations
uses templates, and can be very precise, because it can be easily customized for a domain, but may lack in recall

Supervised

Given a trained classifier for each
Enumerate all pairs and find those pairs
Classify relation of related pairs
Classifier can be whatever works for the domain
What features?
- Headwords/concatenations
- Bags-of-words, bigrams
- NER types
- Syntactic models

Semi-supervised

Labeled data is scarce, use seed patterns/seed tuples
Build a bootstrap relation set
Then use similarity to find new, high-confidence tuples, for a given relation
Extract patterns using these tuples
Weight patterns with confidence based on known tuples

Distant Supervision

Use a database like DBPedia, acquires large number of seeds
Identify matching sentences, extract rich feature set

Extracting Times

Similar to NER, can differentiate between absolute, relative and durations
Rule-based, sequence labeling
Normalized with respect to specific date for calculation/comparison
ISO standard provides XML tags

Event Extraction

Ident events in text, assignable to points in time
IOB tagging via supervised learning
Tends to be built on verbs
Features depend on many things
Timing information can provide temporal ordering, using normal relation extraction
Time-bank corpus can be used

Template Filling

Some situations/events have standard information
Templates represent these, with slots to fill
Recognize templates, then extract and fill slots.

Lecture 10: Question Answering

Factoid questions – answers are simple facts in short text expressions
IR-Based – search for span of text to return answer
Knowledge-Based build semantic rep, and uses deduction to answer from KB
Hybrid uses a combination of the two

Information-Retrieval Based

Given question, find short snippet from a collection
Process question for IR to analyze answers
- Identify relevant documents
- Extract passages
- Filter to identify relevant passages
Must also do answer-type detection, decides what the answer will look like in the end
Question Processing
- Identifies a query, and other information
- Identifies focus
- Identifies question type
- Extracts tokens to submit
  - query is tf-idf and cosine similarity, can be enhanced with bigrams
  - If documents are limited, query expansion can be used (adding terms with a thesaurus, varied suffixes)
- Answer types look at Named Entity types, filter IR on that
  - Can use taxonomy of types (WordNet)
  - Classifier trained, uses embeddings of words, etc
Passages come from ranked set
- identify relevant paragraphs/sentences
- partition documents into passages
- NER used for filtering
- Rank with learned model
Answers extracted with Span labeling
- Return matching named entity if answer typue is specified
- Problem hen answer type is unavailable/non-existent (consider definition questions)
  - Use question keywords matching
  - words not in query
  - Length of sequence of keywords in candidate
  - hand-crafted patters matching candidate answers
- Neural Methods
  - DNN, uses SQuAD; computes embeddings for questions and embeddings for tokens
  - Compare embeddings, uses reading cqmprehension tasks
  - SQuAD dataset
  - Question $q$, passage $p$, $p_{s}(i)$ is probability that $p_{i}$ starts answer $p_{e}(i)$ is probability that $p_{i}$ ends answer span
  - Similarity scores used to predict start/end of answer span
  - Bi-LSTMs with attention, for question and passage; similarity scores used
  - $E$ is glove embedding
  - Input sequence of embeddings to get states
  - Combine into a query vector, being a weighted softmax sum, with $b$ and $w$ being learned weights
  - Passage embedded to vector sequence compared to $Q$
    - Computed similarly,using GloVE and PoS/NER tokens
  - Align attention between question and passage

Knowledge-Based

Answers in structured DB, no need for IR
Must be able to create a query
Requires Semantic Parsing to form SQL or predicate Calculus
Can be done using hand-written features or learned
Supervised
- Dependency parsing, has things like tmod, nsubj
- Uses labeled examples (question-query pairs)
- learn general rules from inductive logic programming
Semi-suprevised
- distant supervision/open information extraction
- REVERB is an Open extractor
- Extracts triples of strings and aligns to things like FreeBase (or other structured relations)
- Match subject and argument to relations
- Then learn mapping between natural language and FreeBase

DeepQA

IBM Watson!
Clinical Decision Support
Uses both methods
Many candidates proposed and scored by evidence
Stages
- Question Processing
- Answer generation
- Answer Scoring
- Confidence merging/ranking
Parsing, NER, relation extraction
Focus, answer type, question classification
Generate answers from structure/text sources
Queries using extracted relations
Eliminates stopwords, upweights terms from focus
Generates score vector from type, time/space relations, supporting evidence
Merge based on equivalence/combine scoring features

Evaluation

Mean Reciprocal Rank
- Assume questions answers provided
- If system under test returns list of answers, score is reciprocal of rank of first correct
- Rank is rank of highest correct answer
Exact Match – fraction of predicted correct answers
$F_{1}$ score – harmonic mean of precision and recall
Datasets like SQuAD, TREX QA, TriviaQA, WikiQA

Lecture 11: Summarization

Similar to QA, but distills important information
- Outlines
- Extracts
- Abstracts
- Summary of threads
Single/Multiple documents
Generically, or focused on query
Steps
1. Content Selection – Chose text to use
2. Information Ordering – Decide how to order
3. Sentence Realization – turn into readable sentences

Single-Document

Single Document
Headline/outline brief extract

Content Selection

Binary classifier (important vs unimportant)
Unsupervised using salience
- TD-IDF weight of term in document (tf: log of count + 1), (idf: log of docs over doc frequency)
- Log likelihood ration, $\lambda(i)$, difference in how likely to see in documents vs in corpus (explicitly probabilistic), $-2\log(\lambda(i)) > 10$ (or other empirically chosen threshold), term is significant
Supervised
- Training set with human-created summary labels
- Extracts: label sentence with 1 if in summary
- Extract features and train classifier
- But requires summaries are extracts, or align source and summaries to find best match
  - Longest common subsequence, edit distance, hidden Markov models can be used for this
Features can be labeled with things like
- position where sentence appears
- Cue phrases
- Informativeness, topic signature
- Sentence length
- Cohesion – relatedness

Information Ordering & Sentence Realization

Single document preserves order of sentences in order of the document
Simplify/compress extracted sentences removing less relevant to summary
Rule-based or Machine-Learned

Multi-Document

A group of related documents
News stories on an event, etc.

Content Selection

Unsupervised approaches best (little data compared to single-document)
Similar methods
But, must be aware of redundancy across multiple documents
Can use a penalty term to each sentence, and could be cosine or alignment
Clustering related sentences and choosing center is also possible

Information ordering

Can't sequence in order of appearance in a document
News articles or documents with dates can sort chronologically
Must maintain coherence
- Adjacent sentences: $s_{1}$ causes $s_{2}$
- Lexical chains: synonyms/hypernyms
- Consistency: pronoun reference to same noun consistent and unambiguous
For each sentence, pairwise compatibility is a graph problem
- Sentences are nodes, edge weights are compatibilities
- Permutation maximizing sum of compatibilities
- Form of TSP – intractable, heuristics exist

Sentence Realization

Still might need to improve coherence
Coherence resolution, apply some rewrite rules

Focused

Summary in response to particular queries
QA but for non-factoid QA
Many documents potentially
Overlap with query, cosine similarity
Specialized systems for query type
Information Extraction to identify important information

Evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) N-gram overlap between generated summaries and ground-truths
Gross equation (slide 20)
For each bigram, how many times did generated summary use it?
Normalized with total bigrams in reference
A recall measure
Rouge-L measures longest-common subsequence
Rouge-SU measures skip bigrams (additional words between pairs)
Pyramid method/summary content units focus on units of meaning more relevant

Lecture 12: Machine Translation

challenging, complex
Must be quick, compact, and often deals with low data transmission
Many variations despite fundamental similarities
Can't translate word-for-word, often need to add, align, delete
Different kinds
- Rough, basic ideas, web search, tourism
- Human-post-edited for high-volume CAHT
- Fully automatic high-quality

Classical Machine Translation

Direct – word-by-word through input text
Transfer – parse, translate structure, generate output
Interlingua – parse input, translate to intermediate representation, generate output
Vauquois Triangle captures approaches

Direct

Morphological analysis, getting word forms
Dictionary translates individually
Reorder by rules
Morphological generation, ensures correct form
Ignores structure, idiom, etc

Transfer

Alter structure to conform
Leverages contrastive knowledge, differences between two specific languages
Steps
1. Analysis
2. Transfer
  - Syntactic, fixes syntax
  - Lexical, meaning from input to output
3. Generation

Interlingua

Language-independent form
Meaning is in a standard form
Semantic analyzer maps inputs to this
Representation could be predicate calculus, semantic decomp, event-based representations, etc

Statistical Machine Translation

A "noise channel model"
- Probabilistic, outputs most probable translation
- Given sentence $F$, $P(E|F)$ is probability output sentence is correct translation
- Search for argmax $E$, by Bayes's rule
- Probability of sentence $P(E)$ is from language model
- $P(F|E)$ is from translation model
- A decoder is used to output $E$
- $F$ is a noisy translated $E$, and the goal is to recover $E$

Models

Phrase Based
- Give $E$ and $F$, could go word by word, instead, use phrases
Group words into phrases, $I$ is number of phrases on english, $j$ is number of output frases
Reorder
$\phi(f|e)$ is translation probability
$d(a_{i} - b_{i-1}) = \alpha^{{a_{i} - b_{i-1} - 1}}$, $\alpha$ between 0 and 1 is distortion, based on movement of phrases
Probability is product of $\phi d$
- Must train $\phi$!
- Can be done with count/normalize method
- Can count if training corpus includes translations and phrase alignments
- Word alignments between inputs and outputs could also be used as phrase mappings
- HMMs are generative for sequencses
- Can use Viterby algorithm to generate sequence, Forward to generate sequence
- Alignment Problem with $I$ states for each word, emission distribution is over output words
- Most likely path generating sequence yields alignment
- Symmetrization used for multi-word phrase alignment
- All states are bidirectionnaly connected, including self-loops, each with a transition probability
- Transition probabilities are based on a non-negative function $c$, on a jump width $i - i^{{\prime}}$
  - Neighboring words in input tend to neighbor in output
  - State-sequences not known
  - Baum-Welch algorithm to train HMMs used to train when sequences unknown
    - Starts with arbitrary values
    - Estimate expected counts
    - Use expected to re-estimate
    - Repeat until converge

Decoder

Takes input, outputs with maxmization
Searches through candidate space
Can't use gradient descent as not continuous
Exponential number, beam search/stack decoding is used for efficiency
Initialize with null
- Pop best
- If complete, return, else
- for each expansion, score and push on to stack
Cost is combination of probability, and approx cost of completion

Evaluation

Human-based: people evaluate based on fluency, fidelity, how much is necessary to "fix" output
BLEU
- Given output and reference, measure closeness
- Weighted average of $N$-gram overlaps with reference translations
- Modified precision, appearance based on max appearance in reference
- $N$ grams for $N$ in 1–4
- Brevity Penalty also used, based on length for shorter outputs
- Easy to calculate, widely used
- Doesn't consider meaning
- Doesn't handle syntax
- Not as good as human judgement

Lecture 13: Coreference Resolution

Identify noun phrases referring to same entity, that is, those that corefer to the same referent
Can be nested, important for MT for correct pronouns
Based on discourse model, incremental from entities in text and their relationships, first mention evokes, later access
Can also map to the real-world, entity linking to ontology

Background

Can refer
- indefinite noun phrases
- definite noun phrases
- pronouns
- demonstratives
- non-stated pronouns
- names
Information status
- new/old
- New could be new or old to hearer (if part of common knowledge)
- Old were introduced previously
- Some are inferred, based on known properties
Non referring expressions exist
- Appositives
- Predictives
- Expletives
- Generics

Tasks and datasets

Detect and cluster mentions
OntoNotes and ARRAU are common datasets

Mention Detection

IDentify spans of text for each mention, retrieves candidates and filters, all N grams <= 10
Filter extraction based on noun phrases, posessives, named entities, etc

Architectures/Classifiers

Entity-based represents each entity in a discourse model
Mention-based consider mentions independently
Models can also use ranking to compare antecedents
Mention-pair takes pair of mentions as input, outputs binary classification of coreference
- hand-coded, or neural network
- Each pair used for training
- Filter negative, create pair of coreferants, create positive, create negative training for each for some k
- Cluster mentions by pair top, and take transitive closure
Mention rank requires integration of resolution with anaphoricity by adding null
- Estimate probabilities
- At test time, pair top and take TC
Entity-based
- Links mentions to discourse entities/mention clusters
- Can use mention ranking approach, adding features
- Or trains RNN to process clusters and combine hidden states
Handcoded features
- Words/embeddings
- Role
- Length
- Matching strings, etc

Neural Models

Span is contiguous sequence of $\leq 10$ words
For a span, estimate probability that each previous span is it's antecedent
- $P(Y_{i}) = \frac{exp(s(i,y_{i}))}{\sum_{{y' \in Y(i)}}exp(s(i,y^{{\prime}}))}$
- $s(i,j) = m(i) + m(j) + c(i,j)$
- $m$ measures span is mention, $c$ measures if span is antecedent
  - $m$ is a feed-forward neural network
  - similar for $c$
Feed-forward NNs take input of vector representation (bi-LSTM w/ attention output)
Train my maximizing sum of probabilities of pairs in ground-truth
- Minimizing cross-entropy

Evaluation

Precision and recall and combination, $F$-measure
$H$ is hypotheses, $R$ is references
- Precision: $|H \cap R|/|H|$
- Recall: $|H \cap R|/|R|$
- $F$-measure: $2(precison)(recall)/(precision+recall)
- MUC $F$-measure, H, R are links, pairs of mentions
- $B^{3}$ are individual mentions

Entity Linking

Connected to coreference resolution, connects real-world to ontology
Detect mentions, disambiguate with supervised learning
They can help each other

Winograd Schema Problems

Kind of a sentence completion
Want to connect ambiguous references
Integrating world knowledge can assist in these issues
Winograd Schema Challenge include coreference problems solved with world knowledge but no automated methods
Somewhat solved with huge models

Gender Bias

Biases exist
Which pronoun connects more frequently with certain professions
Biased datasets
Gender-swapped dataset combined with original can help reduce
WinoBias, GAP datasets available

Lecture 14: Word Senses and WordNet

Dealing with ambiguity is hard
Words can have multiple senses (polysemous words), depending on context
WordNet is database of senses and relations
Word Sense Disambiguation determines sens of polysemous word based on context

Word Senses

Discrete representation of aspect of meaning
Discrete valued senses distinct from continuous embedding
Relationships between senses exist and can enhance semantics
Independent truth conditions, different syntactic behavior, and independent relationships? Do they exhibit antagonistic meanings
Zeugma is used for testing: "serve"

Relations Between Senses

Synonyms
Antonyms
Taxonomic, IS-A
Meronymous, PART-OF
Structured relationships, author/works-of-author

WordNet

150k+ words, nouns, verbs, adjectives, adverbs
Each has definition, synonyms, usage examples, etc
Synonym sets (synsets), of senses, based on senses, not words
Supersenses, categories of senses
Sense relations
- Hypernyms, hyponyms
- Instance hyper-/hyponyms
- Meronyms/part meronyms
Can form a graph structure with labels on edges

Word Sense Disambiguation

Chooses correct sense based on context
WordNet, application-specific dictionary, translations, etc.
Semantic Concordance – corpus with words labeled with sense
SemCor labels Brown corpus with WordNet
Can map from wikipedia/wikidata/wiktionary
Evaluate with $F_{1} =2PR/(P+R)$ precision/recall based
Baseline is most-frequent sense

Embeddings: Contextual

ELMo or BERT often used for contextual embeddings
- Embeddings depend on word's context
- Different from context free embeddings like word2vec or GloVe
- BERT trained by masking random words and predicting
Each labeled token of a specific sense, compute BERT embedding, sense embedding is mean of context embeddings
Predict sense, compute contextual, nearest neighbor among sense embeddings
Sense can be absent from training data, for each sense, look up others in synset, hypernyms, etc, and average embeddings!

Feature-Based

Supervised training of classifier
Features are words, embeddings, PoS around, etc.

Lesk algorithm

No model
Knowledge based
For $w$ choose sense whose description overlaps most.

Induction

Unsupervised
Induce set rather than using pre-defined
For each word in vocab
- For each instance in corpus
- Compute context vector
- Cluster these vectors (each is a sense of $w$)
- Centroid/median vector is a sense vector for each sense $j$ of $w$
- When processing, use nearest sense vector