Notes

Notes, etc. of Samuel Flint

Introduction to Natural Language Processing

Lecture 1: Introduction

  • What is meant?

    • Computers process structured data (databases, spreadsheets, etc)

    • Most data is unstructured. Easy for people, hard for machines

    • Studied since the 40s

    • Progress in large part due to large datasets (corpus linguistics) and ML

  • Applications

    • Part of speech tagging, identifying what's what in a sentence

    • information extraction, identifying named entities (locations, organizations), relations (Lincoln is a city in Nebraska) or times

    • Sentiment analysis, uncovering emotion/feeling in text

    • Question answering, taking natural questions, finding and returning an answer

    • Summarization, retrieving/reporting most important info in a document

    • Machine translation

    • dialogue, engaging in conversations with real users to find preferences/instructions and replying naturally

  • Approaches

    • Primarily leveraging ML (uses large training sets to build models)

    • Models used to classify/generate text, discriminating and predicting labels or output resembling training

  • Challenges

    • English is pretty tough!

    • Syntactically ambiguous

    • Semantically ambiguous

  • Overview

    • Topics introduced in Lecture

    • Homeworks for practice

    • Project is open-ended

    • Topics

      • N Gram models

      • Na\"ive Bayes & Semantic Analysis

      • Logistic Regression

      • Representations/Embeddings

      • RNNs & Attention

      • Information Extraction

      • QA

      • Summarizsation

      • Machine Translation

      • POS tagging

      • Dialogue Systems

Lecture 2: $N$-Gram Language Models

  • Many ways to model text

  • Many are probabilistic, modeling probability distributionss over sentences

    • Compare probabilities to make decisions

  • Models trained from data sets (corpora) by counting occurrences and normalizing

    • Models based on $N$-grams, or word sequences of length $N$

    • Smoothing is used to prevent 0 probabilities

  • Overall: models, evaluation, and smoothing

  • Let $W = \langle w_{1}, \ldots, w_{n} \rangle$ be sequence of words

  • Goal is computation of joint probability

  • Related: Compute conditional probability of next word given prior words

  • Probabilistic model allows us to compute one of these

  • Recall conditional probability (Bayes's Rule), $P(A,B) = P(A)P(B | A)$

  • Chain rule of probability, $P(w_{1}, \ldots, w_{n}) = \prod_{i }P(w_{i} | w_{1}, \ldots w_{i - 1})$

  • Estimation is from counting and dividing (this is the frequentist approach), $C(W)$ counts occurrences in corpus $\mathcal{C}$

  • Problematic for long sequences, as it may not occur, which may mean probabilities of zero or undefined

  • Instead, simplify and use bigrams!

    • First order Markov assumption – the probability of the next event depends only on the event that just happened

    • Generalizes to $n$th order assumption for tuples of $n + 1$ words

    • Simplest case is the unigram model, where words are individual, and gives a 0 order Markov model, or probability is just the count over the size of the corpus

  • Berkeley Restaurant project has 9332 sentences, 1446 words

  • Sometimes things have non-existent bigrams

  • Probabilities then, can be calculated easily. Note, include start/end tokens

  • Probabilities can eventually cause underflow when multiplying!

    • Instead, add logs of probabilities

    • Since probability values aren't generally cared for, instead, comparison is the issue, and this is preserved

  • Works as long as $N$ isn't too large

    • But looses long-range dependencies!

    • Mostly an okay thing and long-range dependencies can be captured with RNNs

    • SRILM, KenLM, Google Book $N$-Grams

  • Evaluation

    • Training set used to compute probabilities

    • Goal is to train such that model generalizes well!

    • Estimate using a separate test set, which is ideally unbiased

    • Performance on this is what we expect to see in practice

    • Don't test on training data!!!!!!!

    • Data is often split around 80/20 for training/test (90/10 also)

  • Common way to evaluate on test set

    • Perplexity $PP(W)$, calculates inverse probability normalized by number of words $N$, $PP(W) = \sqrt[N]{1/P(w_{1}, \ldots, w_{N})}$, for bigrams, similar, with sum of bigram probabilities

    • Minimizing perplexity on test set, maximizes test data's probability

  • Models can also be used to generate text!

    • Generate a random bigram according to $P(\langle s \rangle, w_{1})$

    • Generate bigram according to probability, and continue until we generate an end token.

  • Larger $n$ valued models are more coherent, and it's important to select an appropriate corpus!

  • Valid $n$ grams may not appear in training sets!

    • Smoothing allows prior information to be incorporated into the model before data is used to update it

    • Laplace Smoothing

      • If zero probabilities come from zero count,s then ensure no counts are zero!

      • When counting $n$-grams, initalize each counter to 1 rather than 0 (pseudocounts)

      • Probability estimates are based on modified counts, and change estimate to a Laplace form, $P_{Laplace}\langle w_{i} \rangle = \frac{C(w_{i}) + 1}{N + V}$, where $N$ is the size of the corpus, $V$ is the vocabulary

      • Amount added can be $k \neq 1$ depending on weight of the prior

    • Backoff use trigram if counts non-zero, else bigram or lowern

    • Interpolation, mixes unigram, bigram and trigram models

      • Fix constant parameters $\lambda_{i}$ to weight each term where $\sum_{i} \lambda_{i} = 1$

      • Or have values depend on context

      • Tune $\lambda$ on independent hold-out set/use cross-validation

    • Kneser-Ney Smoothing

      • Most common method

      • Based on absolute discounting

      • Redistributes probability mass from common to less common, simply subtracting $d = 0.75$ works well, and then add interpolation on unigrams.

      • Extends by including a term which models probability that $w$ will be in a novel continuation, how many words will pair with $w$

      • Final formula kinda long (slide 27)

    • Stupid Backoff

      • Stupid as it doesn't care about renormalizing to a true probability distribution

      • Roll back to shorter unigrams if counts = 0, and don't renormalize since it's too much time

Lecture 3: Na\"ive Bayes & Sentiment Analysis

  • Classification is core area of ML & NLP

  • Many tasks, includes spam filtering & sentiment analysis

  • Simple generative model, na\"ive Bayes, used for classification

Discriminitave vs generative

  • Discriminative models take an instance as input, and predict a label

    • Real-valued labels are regressions

    • Finite-set labels are classifications

    • Training comes from a training set with true labels (supervised learning)

    • Then searches set of possible models optimizing for some objective (minimizing prediction error, etc.)

    • Many model types

      • Decision trees

      • ANNs (discussed later)

      • SVMs

      • nearest-neighbor

  • Generative models define probability distributions

    • Generate new instances resembling training data

    • HMMs, Bayesian networks, probabilistic models, na\"ive Bayes

    • Trained via unsupervised learning, that is, lacking labels; note, labels mean generative model can exist for each class

    • Can be used to discriminate estimating probability that a given instance is generated

Classification

  • discriminative models classify text into finite set of categories

  • Subject classification, spam filtering, author ident,

  • Given training set output a classifier $\gamma: \mathcal{D} \rightarrow \mathcal{C}$ predicting labels of new documents

  • Representing each document is important

    • Option: each document is a bag of words with counts associated with them

      • Simple model, looses position information

      • May need to drop some things

Na\"ive Bayes

  • Recall Bayes Theorem: $P(c|d) = (P(d|C)P(c))/P(d)$

    • The probability of the class given the document

    • $P(c)$ is the prior probability of label

    • $P(d)$ is the probability of the document

    • $P(c|d)$ is probability of $c$ given $d$

    • $P(d|c)$ is probability of $d$ given $c$

  • Product Rule $P(A \land B) = P(A | B)P(B) = P(B | A)P(A)$

  • Sum Rule $P(A \lor B) = P(A) + P(B) - P(A \land B)$

  • Total probability

    • If events are mutually exclusive, and sum of probabilities = 1,

    • $P(B) = \sum P(B|A_{i})P(A_{i})$

  • MAP/Maximum a Posteriori classification

    • Given document, our goal is to find the most likely label

    • Done using argmax

    • $argmax_{c \in C} P(c|d)$, or, assuming feature representation, of $P(x_{1}, \ldots x_{n} | c)P(c)$

    • $P(c)$ is easily estimated, fraction of documents that are class $c$

    • Estimating $P(x1...xN|c)$ is harder, as the number of combinations is exponential, and there won't be sufficient data

  • Make the Naive Bayes Assumption!

    • $P(x_{1}, x_{2}, \ldots, x_{n} | c) = \prod_{i} P(x_{i}|c)$

    • Assumes, of course, that feature values are conditionally independent given the class

    • Classifier is $c_{NB }= \mathrm{argmax}_{c \in C} \prod_{i} P(x_{i}|c)$

    • Values needed are polynomial!

    • Standard method, and incredibly practical

    • Works well when you have moderate/large training sets

    • when attributes that describe instances are conditionally indepent given classes

    • Learn Algorithm (See slide 15)

      • Estimate $P(c)$, call $\hat{P}(c)$

      • For each word, $w_{i}$ in vocabulary, estimate, $\hat{P}(w_{i}|c) = \frac{count(w_{i},c)}{\sum_{w\in V} count(w,c)}$

    • Classify:

      • Compute above $c_{NB}$ using $\hat{P}$

    • Smoothing done using Laplace smoothing with Pseudocounts!

Sentiment Analysis

  • Replace count-based bag-of-words with binary indicators

  • Negations have to be handled

Lecture 4: Na\"ive Bayes & Sentiment Analysis, Continued

Evaluating Classifiers

  • Goal: generalization, and estimating performance

  • But, other goals possible

    • Estimate generalization performance of model/hypothesis

    • Tuning parameters

    • Comparing two algorithms on specific task

  • Can't answer with perfect certainty

    • Estimator must be chosen, determining the distribution thereof, and bounding it

    • Must work regardless of distribution of training/test data

  • Statistical variation related to specific application is significant

  • Plan experiments well, don't test on training data!!

    • It biases the performance estimator

    • And don't let duplicates into training/testing sets if using data augmentation

    • Holds for validation set used for early stopping/tuning parameters!

  • Confidence intervals are used

    • $error_{\mathcal{D}}(h)$ is the 0-1 loss hypothesis on instances drawn according to $\mathcal{D}$. If test set has $N$ examples drawn independently (of $h$)

    • 0-1 loss is the probability that $h$ makes mistake

    • and $N \geq 30$

    • Then, with 95% probability, the error likes in $error_{\mathcal{V}}(h) \pm 1.96\sqrt{\frac{error_{v}(h)(1 - error_{v}(h))}{N}}$

    • So, if the hypothesis misclassifies 12 of the 40 examples, with 95% confidence, the error is in that range.

    • Confidence interval can be generalized by changing real-number coefficient

      • Based on binomial distributions, mean error of an unbiased estimator

      • Can approximate binomials by normal distributions

      • And so, then, based on standard deviations in normal distribution

      • Can be justified due to the Central Limit Theorem

    • Steps

      1. Pick Parameter to estimate

      2. Choose Estimator

      3. Determine Probability Distribution

      4. Find appropriate interval

Comparing Learning Algorithms

  • Can't just compare error rates on single test set

  • Use instead $K$-fold cross validation and paired $t$-test

K-fold cross-validation

  • partition data set $I$ into $K$ equal sized subsets, of at least size 30

  • For i from 1 to k

    • use $X_{i}$ for testing

    • Let the training set be all less $x_{i}$

    • Train each algorith

    • Let $p_{i}^{j}$ be error on test set

    • $p_{i} = p_{i}^{1} - p_{i}^{2}$

  • Error difference is estimated as sum of $p_{i}$ all over $K$

  • Estimate confidence that true error is < 0.

  • Confidence that $L_{I}$ is better than $L_{2}$ on the task

  • One-sided test with confidence from $t$ distribution with $K - 1$ DOF

  • Then calculate, true difference of expected error is $p + t_{c,K-1} s_{p}$

    • $s_{p} = \sqrt{\frac{1}{K(K-1)} \sum_{i=1}^{k} (p_{i} - p)^{2}}$

    • Then, check if $p < - t_{c,K-1} s_{p}$

  • Must understand that to show that learning algorithm performs better, at 95% confidence, 5% chance of wrong

    • Overall confidence is 18.5% chance that at least one is wrong

    • Must account or use a more appropriate test!

More specific measures

  • Prediction error may not tell the whole story!

  • Bad test data may mean that accuracy isn't helpful

  • Confusion matrices can be used instead

    • True class vs predicted class

    • Cells for true positives, false positives, true negatives, false negatives

    • Generalizes to many classes

    • Normalized to fractions and represented as heat map

  • Precision-recall curves

    • Precision is fraction of retrieved that are positive

    • recall is fraction of positives retrieved

    • Can vary threshold to trade precision for recall

    • Compare curves based on containment

    • More suitable than ROC for some large numbers of negatives

    • Uses $F_{\beta}$ to combine at a specific point

Lecture 5: Logistic Regression

  • Recall Na\"ive Bayes, which is generative

  • Logistic regression is discriminative

  • Weighted sum of features of instance, and processing with sigmoid or softmax, which is probability of $x$ being class $c$

  • Output is probability, performance measured with cross-entropy, a loss function

  • Optimize using stochastic gradient descent

  • Logistic models built into neural nets for deep learning

Sigmoid

  • Each instance is a vector $x = [x_{1}, x_{2}, \ldots, x_{n}]$ of features

  • Supervised training, each instance has label, $y \in \{0, 1\}$

  • Output of model on input is $P(y = 1 | x)$

  • Weighted sum $z = wx+b$, where $w_{i}$ is feature weight, $b$ is the bias term

  • Final output is $P(y = 1 | x) = \sigma(z) = 1/(1 + e^{{-z}})$

  • Discrete valued prediction thresholded at 1/2

  • Called logistic/sigmoid function

  • Squashes $z$ into $[0, 1]$! (sometimes in older stuff called a "squashing function")

  • Assume classifier is trained with given values and biases. (5.1.1 from textbook)

    • Compute weighted sum, run through logistic function, and compare to threshold

Training

  • $w$ and $b$ must be discovered

  • Given labeled training corpus, map documents to feature vectors, and attach label

  • Initial $w$ and $b$ randomly!

  • Evaluate each training vector predicting $P(1|x)$ and use loss function $L$ to compare prediction to true label

  • stochastic gradient descent used to update $w$ and $b$, reducing $L$ on $x$

  • Repeat

Loss Functions

  • Many available

  • Cross-entropy used for classification/prediction of probabilities

  • $\hat{y} = \sigma(wx + b)$ is the prediction, $y \in \{0, 1\}$ is true label

  • $L_{{CE}}(\hat{y}, y) = -(y \log \hat{y} + (1 - y)\log(1 - \hat{y}))$

  • Since $y$ is binary, one term is $0$ and $L_{CE}$ is negative of log of probability of correct class. Larger predicted probability, the smaller the loss!

Stochastic Gradient Descent

  • Goal is to find params, $(w, b)$ or, $\theta$ overall and minimize average loss on entire set!

  • $\theta^{*} = \argmin_{{\theta}}\frac{1}{m} \sum_{{i = 1}}^{m} L_{{CE}}(\hat{y}^{(i)}, y^{(i)}; \theta)$

  • Continuously differentiable, as $L_{{CE}}$ is by def'n, allows for gradient descent

  • Consider a linear unit: output is weighted sum of inputs with no $\sigma$ applied (weighted sum, bias, no sigmoid!)

  • If there's only one feature, this is regression, that is, find a straight line that best fits training data

    • Let pass through origin, $b = 0$

    • Slope is just a single parameter, $w_{1}$

    • $J(w_{1}) = \sum_{i}L(\hat{y}^{{(i)}}, y^{{(i)}}; w_{1})$

    • Goal is to minimize $J(w_{1})$

    • Writing generally, can be plotted and calculated.

    • Take derivative & solve for 0

  • Calculated using $w^{{\prime}} = w - \eta \nabla J(w)$

    • $\eta$ is the learning rate

    • Gradient, $\nabla J(w)$, is partial derivatives $\left[ \frac{\delta J}{\delta w_{i}}\right]_{i=1}^{n}$

      • $\frac{\delta J}{\delta w_{i}}$ is a partial derivative is how much a change in $w_i$ changes $J$

    • $J$ is error, compute gradient of $J$ with respect to parameters, apply gradient descent

      • Cross entry loss function with respect to each weight (see textbook for function)

      • $x_{j}(\sigma(wx+b) - y)$ is the partial derivative for single instance

      • mean of partials for multiple instances

      • Plug into formula to get weight update

      • Update is smaller if prediction is closer

      • As magnitude of feature grows, $w_{j}$ becomes more responsible for more error

      • If $\hat{y} > y$, $wx + b$ needs to decrease

  • Standard GD, mean of vectors across entire set

    • Esitmate of expected gradient on random instance

    • Can require huge memory

  • Stochastic Gradient descent partitions, randomly, into training set minibatches of size $m$

    • Compute average per minibatch

    • Update per minibatch

    • Each pass through all is an epoch

    • Saves memory

    • Gradient estimates have high variance, avoid getting stuck in local minima in non-convex situations

Regularization

  • Is perfect fit the best model?

  • Recall, the goal is to generalize well, there may be noise in data afterall

  • Perfect fits tend not to generalize well

  • Training accuracy vs simplicity

  • Want moth accuracy and simplicity

  • $h$ is a model, training error on $\mathcal{X}$, $error_{{\mathcal{X}}}(h)$ is sum of loss function over size of dataset

  • Generalization error, $error_{{\mathcal{D}}}(h) = \mathbb{E}_{{x \tilde \mathcal{D}}}[L(h(x), y)]$

  • Model overfits if there's an alternative model such that $error_{{\mathcal{x}}}(h) < error_{{\mathcal{X}}}(h^{{\prime}})$ and $error_{{\mathcal{D}}}(h) > error_{{\mathcal{D}}}(h^{{\prime}})$

Regularizers

  • When many params, more likely to overfit

  • Training adds regularizer, $R(\theta)$ to argmin, weighted with $\alpha \geq 0$

  • L2, square of norms

  • L1, norm of vectors

Multiclass

  • Done with "Softmax", when $> 2$, one set of weights and one set bias for each class

  • Then, $z_{c} = w_{c} x + b_{c}$

  • Probability distribution calculated using softmax, not $\sigma$

    • $\mathrm{softmax}(z_{c}) = \frac{exp(z_{c})}{\sum_{{c^{{\prime}} \in C}} exp(z_{c^{\prime}})}$

    • Generalize cross-entropy to handle multiclass, y is generalized

    • Labeled with one-hot vector

    • When training, $\delta(-\log \hat{y}_{c^{*}})/\delta w_{i}$ for each weight, update with SGD

Lecture 6: Vector Semantics and Embeddings

  • Vector representations of words

  • Success of tasks depends on document/word representations

  • Bag-of-words representation from ch4, engineered features from ch5

    • BOW, word corresponds to dimenion in a $|V|$-dimensional vector

  • Dense vector representation

    • embedded representation, represents words in numeric vectors of smaller (10, 50, maybe more) dimensions

    • Want similar words to cluster

    • Similar contexts imply similar meanings

    • Learn embedding by grouping words based on context

  • Consider: lexical, vector semantics; words and vectors; cosine similarity; TF-IDF; Word2Vec

Lexical Semantics

  • How do words relate?

  • Synonymy – similar meanings

  • Antonymy – opposing meanings

  • Similarity – based on what is similar, "car" and "bicycle"

  • Relatedness – not similar, relating to each other, "coffee", "cup"

  • Semantic field – covering some sort of semantic domain

Vector Semantics

  • Want to numerically represent words such that those that share some sort of lexical semantics cluster

  • Because regression models (and NNs) require numeric inputs!

  • Trained model that predicts positive when vanish appears should respond similarly to disappear

  • Representations grouping words by context

  • Two-dimensional projection of a 60-dim embedding (t-SNE)

Words & Vectors

  • Term-Document matrix, $|V| \times |D|$ matrix. Column $d$ is $d$'s BOW representation

  • Even bag of words representation clusters!

  • Rows of TD matrices can represent words

    • Context then is the set of documents that it appears in and how often

    • Similar words will have similar representation

  • Word-Word or Term-Context matrix can be used ($|V| \times |V|$)

    • $(i,j)$ hold the number of times $i$ and $j$ appear in the same context

      • Context can be document, fixed window, etc.

Cosine Similarity

  • Compute similarity of words/documents

  • Often, cosign of the angle \[\frac{v \cdot w}{\Vert v\Vert \Vert w \Vert} = \frac{\sum_{{i = 1}}^{N} v_{i} w_{i}}{\sqrt{\sum v_{i}^{2}}\sqrt{\sum w_{i}^{2}}}\]

  • Normalize vectors by dividing by length and take dot product of unit vectors

    • 1 if identical, 0 if orthogonal

TF-IDF: Term Frequency-Inverse Document Frequency

  • Must be careful about context definition

  • Some words appear frequently in any corpus

  • Can inflate similarity measures!

  • Weights terms by frequency, reducing weight if appearing in many documents

  • Term Frequency is the count of $t$ in $d$ +1 to avoid log zero \[\mathrm{tf}_{t,d} = \log_{10}(\mathrm{count}(t, d) + 1)\]

  • Log used to differentiate between orders of magnitude \[\mathrm{idf}_{t} = \log_{10}\left( \frac{N}{\mathrm{df_{}}_{t}}\right)\]

    • $N$ is documents

    • $\mathrm{df}_{t}$ nuber containing $t$

  • Weight of $t, d$ is $w_{{t,d}} = (\mathrm{tf}_{{t,d}})(\mathrm{idf}_{t})$

  • Word-word matrix then weight by tf-idf

  • Vector for term computed as normal

  • normalize/compute cosine similarity

Word2Vec

  • From Mikolov et al.

  • vectors from tf-idf long, sparse

  • Dense, short vectors prefered

    • Fewer params to learn!

    • Captures synonymy better!

    • Turns out, it works better in practice!

Intuition

  • Given $w$, rather than count how often it appears near a word

  • Predict how likely to appear near it

  • Trainable using logistic regression

  • Supervised, since any word appearing near another is a positive example

  • Negative sampling also needed

  • Weights can be used as embeddings

Training

  • Given $t$

  • and context words $c$ as positive example

  • Sample for negative examples, and count, divide by words not found (as probability) for some parameter $\alpha < 1$

  • Train with logistic regression, random weights

  • Final weights are given embeddings

  • Then, given $t$ and $c$, predict probability near each other. $t$ and $c$ are represented as embeddings, similarity is dot-product; probability, from logit function

    • Probability of negative pair, is above from 1

  • Given positive pairs and negative pairs, maximize log sum of positives + log sum of negative

    • Optimizing finding an embedding maximizing $c \cdot t$, and minimizing $n \cdot t$ for all $t$

  • $w$ can be both target and context. Training gives two embedding matrices, a target and an context

    • These are the regression parameters, row $w$ from $T$ is $d$-dim representation of word, column $w$ of $C$ is $d$-dim representation

    • $T$ is often final embedding

    • concatenation could be option

Visualization

  • High dimension representations are hard to visualize

  • Hierarchically clustered dendograms are option

  • t-SNE projects to 2 dimensional space

Semantics

  • Embeddings can allow semantic arithmetic

    • $(king - man) + woman \approx queen$

  • Adjective to comparative to superlative

  • Country-capital common distance

  • Can reveal biases!

Miscellaneous

  • word2vec applied to Materials Science paper abstracts

    • Predicted clustering groupings in table

    • Predicted properties of materials before scientists discovered them

  • .*2vec exists and extends beyond text

    • node2vec embeds representation of graphs

    • Context must be defined for embedding

Presentations: Project Ideas

Deleting

  • Movie Critic Analysis

    • A form of sentiment analysis, but perhaps on a continuum

    • Training a model to discriminate between professional and amateur critics

    • IMDb and Stanford Sentiment Treebank datasets

  • Music Analysis

    • Why are songs more popular than others

    • Why things related to ranking happen

    • Sentiment analysis on lyric

      • predictive from sentiment, etc.

    • Datasets from Billboard charts

  • Professor Ratings

    • Trends, correlation with language

    • RateMyProfessor datasets, but most focus on correlation between numerical scores

Scooby Data Doo

  • Intent Classification

    • Assigns document intent

    • Useful for businesses to provide service, chatbots, automated assistants

  • Toxic Comment Classification

    • More interactions online, but people are kinda toxic

    • Understanding interaction is beneficial in making the internet a safer place

    • Filtering social networking sites

    • Understand how hate speech etc., spread through network sites

Google Translate 2.0

  • Sentiment Analysis

    • Marketing use

    • Employee satisfaction

    • real-time information

    • public opinion

    • spam detection

  • Dialog Generation

    • Goal-oriented

    • Conversational systems

    • Often a mix of the two

    • Customer Service

    • Chatbots

    • Blended approach

    • Pattern recognition, state machines, Markov processes, vector space, neural networks

  • Grammar Correction

    • Spelling, grammar, punctuation errors

    • Sequence tagging

    • CNNs

    • transformer-models

Proof By Intimidation

  • Taxonomy Translation

    • Experts use special, domain ontologies for efficient communication, jargon

    • Developing models that can be used to translate from experts to non-expert

    • For a domain, learn an ontology for experts; then a general, for laymen; find connections between the two

  • Common Sense

    • Reasoning and Validation in language

    • Uses understanding and generation

    • High accuracy, only text-to-text

    • Applicable in many areas: taxonomic reasoning, recommendation systems, chatbots, MT

Latural Nanguage

  • Speech processing

    • Recognizing spoken language

  • Semantic Analysis

    • Understanding meaning

  • Lemmatization

    • De-inflecting words, returning to original form

  • Text Classification

    • Categorization

  • Summarization

    • Simplify document while communicating core message

    • Not always generative

    • Removing redundant information

    • Abstracting/extracting

  • Sentiment Analysis

    • Exactly what's been previously covered

    • Aspect-based analysis, looks for multiple indicators/sentiments

Homo Bayesians

  • Data to text generation

    • auto-correct/text completion

    • Various sorts of neural models, BOW, RNNs w/ Attention, etc

    • Text to SQL, etc

  • Dialogue State Tracking

    • Component of dialogue system

    • Various ways of handling state

    • Belief Tracking

Bricc Squad

  • POS tagging

    • Solves other NLP problems

    • Helps with language learning

    • Meaning between homophones

    • Lexical

    • Rule-based

    • Probabilistic

    • Deep Learning

  • Missing Elements

    • Filling things in

    • Missing elements in foreign languages

    • Fused head ident (find phrase missing noun; rule based or ML)

    • Fused head resolution (finding in context)

    • Bridging anaphora

    • Bridging anaphora resolution

Lecture 7: Neural Networks and Neural Language Models

  • Consider human neural nets

    • $10^{10}$ neurons

    • $10^{-3}$ switching time, $10^{{-10}}$ switching in silicon

    • Connections per neuron, $10^{4}$ to $10^{5}$

    • Scene recognition is incredibly short

    • 100 inference steps, which doesn't seem like enough

    • But computation is massively parallel

  • Goal is to simulate this

    • using neuron-like switching

    • weighted interconnections among units

    • highly parallel, distributed

    • Want to tune weights automatically

    • Many layers learn to extract features, rather than manually creating this features

    • Note: major differences between ANNs for ML and ANNs for modeling biology

  • Used

    • high dimensional input, discrete or real-valued

    • output can be discrete or real-valued

    • Output is vector

    • Data may be noisy

    • Target function form is unknown

    • Readability is unimportant, and long training is acceptable

  • History

    • Linear units, perceptron algorithms (1940)

      • couldn't handle data not linearly separable

      • Couldn't train multi-layer networks, but aware of usefulness

    • Multilayer networks trained with Backpropagation (1980)

      • Replaced with SVMs, Boosting in 1990s

    • Deep Architectures (2000s)

      • Better hardware, software support for deep networks

      • Uses backprop withh larger datasets, algorithm improvements

        • improve performance considerably

      • Impressive applications

    • The Inevitable — Skynet, TBD

Basic Units

  • Takes weighted sum of input, $z$

  • Passes through non-linear activation function $f$

  • If $f$ is logistic, it's the same as logistic regression function

  • Single nodes are limited in what they represent

  • Linear Threshold Unit, the perceptron

    • Activation thresholds weighted sum at 0

    • Represents things like $\land$ well

    • Other functions aren't representable!

    • Because they're not linearly separable, XOR

    • Would be a network of units

The XOR Problem

  • XOR, intersection of two separators

    • $g_{1}(x) = 1x_{1} + 1x_{2} - 1/2$

    • $g_{2}(x) = 1x_{1} + 1x_{2} - 3/2$

    • So, $z__{}i = 0, g_{i} < 0$, $1$ otherwise

    • Now, $g(z) = 1z_{1} - 2z_{2} - 1/2$

  • Remaps vectors $x$ to $z$ so that separable in new space

    • Input layer, output layer

    • hidden layer is $g_{1}$, $g_{2}$

    • Two-layer perceptron/feed-forward neural network

    • Non-linear activation in hidden output

    • Many dozens of hidden layers, several output nodes

  • Activation functions: $z$ is weighted sum

    • Linear unit

    • Sigmoid/Logistic

    • Softmax (generalizes sigmoid)

    • Tanh

  • Hidden nodes

    • Rectified Linear Unit, max 0, z

      • Good default

      • Variants to handle $<0$ differently

Backprop

  • Training multiple layers means tuning parameters in all

  • Uses gradient descent minimizing a loss function

  • Cross-entropy is a common loss function

    • $L_{{CE}}(\hat{y},y) = -\sum_{{c \in C}}y_{{c}}\log\hat{y}_{{c}} = -\log\hat{y}_{{c^{*}}}$

  • Must compute gradient of loss function for each param, even several layers from output

  • Backprop feeds forward inputs to outputs, propagates back error via chain rule for derivatives

    • Simply, modularly decomposed

Computation Graph

  • Given complicated function, want to know partial derivative with respect to parameters

  • $f$ is represented via modular fashion

  • Decompose gradient into basic op

    • $\frac{\delta f}{\delta f} = 1$

    • Then partial with respect to edges leading, $\delta f/\delta a$ = 1, because partial with respect to sum is 1

    • Then, with respect to $x_0$ is $1w_0 = 3$ (via chain rule)

    • For $x = [1.0, 4.0]$, $\nabla_{w} f = [1.0, 4.0]$

  • Then, for $x = [1.0, 4.0]$ for $f(x) = \sigma(x)$, $\nabla_{w} f = [0.1966, 0.7866]$ (once it's done all the way through)

  • Modular, once we have a formula for gradient, we can apply anywhere in larger graph

  • Generalize to $J(w) = \frac{1/2}\sum_{{i=1}}^{{n}}(\hat{y}_{{i}}^{t} - y_{i}^{t})^{2}$

    • $w_{{5*}}$ and $w_{{6*}}$ tie to output units

    • Gradient/weight updates as before, then

      • $w^{{\prime}}_{{53}} = w_{{53}} - \eta\frac{\delta J}{\delta w_{{53}}}$

  • Multivariate chain rule says sum path from $J$ to $w_{{42}}$ – looks kinda ugly

    • Computationally simple! Compute for specific input

    • Modular form means once things are computed, plug values in and compute for earlier layers

Backprop algorithm

  • Submit input $x$, fed forward to outputs

  • Compute network loss

  • Prop error to compute loss gradient with respect to each weight, update weights

  • Automatic in TensorFlow, automatic differentiation based on computation graph

Putting Things Together

Hidden Layers

  • How many?

  • Deep networks build potentially useful representation via composition

    • Layer is vector-valued, non-linear

  • Performance improvement isn't only from more complex networks

  • Increasing layers increases chance of overfitting, must have lots of data with deep networks, needs longer training

  • Boolean functions representable with two layers

    • Bounded, continuous function represented with arbitrarily small error with two layers

    • Function with arbitrarily small error in three layers

    • Only an existence proof needed

    • Exponentially many nodes in layer

    • Might not be able to find correct weights

    • Overfitting and need for regularization noted

Regularization

  • Prevents overfitting

  • Add penalty term $R(\Theta)$ that is, the sum of the sizes of parameters

  • Other methods:

    • early stopping – when performance worsens on a validation set

    • Dropout – ignore random subset of edges

    • Data augmentation – copies of random instances with noise

    • Inject noise into hidden layers

    • Batch normalization of output before input to next layer

  • Be careful not to underfit also (overfit, then regularize)

Initialization!

  • Early on, random numbers near 0, $\mathcal{N}(0, 1)$

    • Nearly linear for logistic function, so GD works better

    • Deep networks increases variance, poor optimization due to vanishing gradients

  • Glorot initialization per layer, if has $n_{in}$ and $n_{out}$, initarize uniform over $[-r, r]$ or $\mathcal{N}(0, \sigma)$

    • $r = a \sqrt{\frac{6}{n_{in} + n_{out}}}$

    • $\sigma = a \sqrt{\frac{2}{n_{in} + n_{out}}}$

    • $a = 1$ for logistic, $a = 4$ for tanh, $a = \sqrt{2}$ for ReLu

Optimizers

  • Variants on gradient descent

  • Momentum optimization

    • Momentum term $\beta$ keeps updates moving in same direction

    • Update looks like $w^{{\prime}} = w - \beta m + \eta\nabla J(w)$

    • Moves through small local minima to better ones and along a flat

  • AdaGrad

    • Standard descends too quickly, then crawls too slowly

    • $w^{{\prime}} = w - \eta\nabla J(w) \oslash \sqrt{s + \epsilon}$

    • $s = s + \nabla J(w) \otimes \nabla J(w)$

    • circled ops are elementwise

    • $s$ accumulates squares of radient and learning rate for each dimension scaled

  • RMSProp

    • Exponentially decays old gradients to address this, adds $\beta$ to do so.

  • Adam

    • Combines Momentum and RMSProp

    • Uses iteration counter to prevent vanishing, etc.

Language Models

  • Recall language model can take input sequence, and predict probability over next words

  • Feed-forward does the same, takes input sequence, outputs probabilities

  • word2vec embeddings used

  • Pre-trained (faster) or trained with network (may be better)

  • Can fine-tune a pre-trained embedding to save time

  • Fixed word sequence input

  • Softmax output compared to target word

  • Trained with cross entropy loss

  • Fixed word sequence input

    • Embedding learned with model

    • Softmax compared to output

    • trained with cross-entropy loss

Summary

  • Neural units take weighted sum, and applies non-linear activation

  • Multilayer networks used because single unit is limited

  • hidden layers allow learning of own representations – nonlinear feature extraction

  • Stochastic GD used for training

    • Calculated using back prop

  • Neural models predict word after sequence of words

    • Can use pre-trained embeddings or learn their own.

Lecture 8: Sequence Processing with Recurrent Networks

  • Earlier required fixed-size input

  • RNNs work on sequences, like text, biological sequences, video, audio, etc

  • Can create novel output, text in a specific style, music, etc

Basics

  • Recurrent cells have connections going backward as well as forward

    • At $t$, neuron receives input $x_{t}$ and its output $y_{{t - 1}}$

  • Recurrent layer can be built, where each node gets the vector $x_{t}$ and the vector $y_{{t - 1}}$ (that is, the whole layer's output)

  • Output depends on past, that is, memory/state

    • State at $t$ is $h_{t} = g(h_{{t - 1}}, x_{{t}})$, and output is $y_{t} = f(h_{{t - 1}}, x_{t})$

    • $h_{t}$ then stores important information about input

    • Forward pass – evaluate forward by unrolling in time

Training

  • uses backprop through time

  • Unroll through time, use backprop on expanded graph

  • Forward pass batch of sequences through network yields output

  • Output sequence evaluated using cost

  • Gradients propagated backward through unrolled network and params

  • Sequence loss, comparing two sequences

    • weighted average of cross entropy across sequence, which can target certain parts of sequence

  • BPTT means gradient flows farther, can "vanish" or "explode"

    • Can happen with any network, RNNs very susceptible

    • Clipping to range $[-1, +1]$ mitigate explosion

    • Can also do batch normalization

Applications

Recurrent NL Model

  • Can use recurrent architecture to model language

  • Can predict netxt word in sequence given current & previous hidden

  • Train by feeding sequence of word, output softmax over next, train with cross entropy

Generating Sentences

  • Feed in a start, generate output, output fed into network as next input, continue until end or length limit

    • Hidden state fed in with random/all zeros

Labeling

  • Given sequence of words, categorize each

  • Softmax/argmax over classes, POS taging, named entity recognition

  • Can also classify an entire sequence, called a sequence to vector model

    • ignore all output except last

    • Compute loss on final output and apply BPPT

Stacked & Bidirectional RNNs

  • Can stack RNN layers on each other

  • bidirectional – process separately from end towards start

    • run in opposite directions, outputs combine in some way (concat, add, etc)

    • Can be used for sequence classification

LSTM & Gated Units

  • Long-term memory – vanishing & exploding gradients can be problematic

  • Can suffer from long times with long sequences

  • First inputs have diminishing impact

  • Add "long-term" memory

  • Accumulates information, and can decide to forget

LSTM

  • Vectors, $h_{t}$ is short-term, $c_{t}$ is long-term state

  • At $t$, some memories from $c_{{t - 1}}$ can be forgotten in forget gate, and new ones from input gate are added

  • Results are sent as $c_{t}$, $h_{t}$ and $y_{t}$ are from processing long-term in output gate

  • $g$ combines input with previous state

  • $f, i, o$ are gate controllers, vectors

  • $f$ is size $n$, and 0 or 1

  • $i$ controls remembering $g$

  • $o$ controls what long term goes out and to hidden

  • tework learns to remember based on input and last hidden

  • Peephole can be added, lets last long-term affect controlls

GRU

  • Simplified version of LSTM

  • Merges long-term and short-term states

  • Merges forget and input controls

Words, Subwords, Characters

  • Character level modeling sometimes used

    • Can be helpful to have more detailed information

    • Can use other RNNs to process at finer level and combine with word-level embeddings

  • Can be done using a birnn, and concatenating input, giving a character-level word-embedding

Lecture 9: Encoder-Decoders, Attention and Contextual Embeddings

  • Text generation of model generalized/applied (Machine Translation)

  • Encoder-decoder architecture

    • Encoder takes input, produces embedded representation, context

    • Decoder generates output from context

  • Generalized further by going from context vector to sequence contexts and attention mechanism

Text Generation

  • NLM can generate text sequence

  • Must feed start symbol

    • $y_{{t}} = f(h_{t})$, $h_{t} = g(h_{t - 1}, x_{t})$, $x_{t} = y_{t - 1}$ (this is an autoregressive model)

  • Can generalize from a prefix (NLM completes existing sequence)

    • Prefix input triggers hidden state vectors, outputs from prefix ignored

    • hidden state captures context

  • Can be used for MT

    • NLM trained on bitexts – sentence concatenated with translation

    • Feed sentence 1 then generate translation

Encoder-Decoder Network

  • Previous model generalizes to encoder-decoder

  • Encoder input generates states

  • Context vector is a function of hidden states, the embedded representations

  • Decoder takes $c$ and creates new states and outputs

  • Encoder can be any RNN, stacked biLSTM, etc (same with decoder)

    • c is final step, top layer encoder concatenated

    • Initialise decoder's state = c

    • Decoder state is function depending on output, previous state, context

Attention

  • Context vectors passed from encoder to decoder are embedded representation of input sequence

    • $c$ represents the entire input

    • Can't ensure context vector is sufficiently large to represent context!

    • Can use attention mechanism

      • context becomes sequence of vectors

      • Decoder focuses on relevant subset

  • Static context vector becomes sequence of context fectors

    • Uses score function, $score(h_{{i - 1}}^{d}, h_{j}^{e})$ gives relevance of encoder state to decoder state, likely uses some weight function (bilinear interpolation, can be learned!)

    • Concatenate scores over j into relevance vector

    • Normalize with softmax, and sum over j by encoder states

  • Architecture

    • Compute context vector at each step

  • Attention $\alpha_{{ij}}$ shows probability that output $y_{i}$ is aligned to a given input $x_{j}$

Lecture 9: Information Extraction

  • nat lang is unstructured, difficult to analyze automatically

  • IE identifies important information puts into strucutred form

    • named entity recognition (identify type of nouns)

    • relation extraction (ident. semantic relations among entities)

    • event extraction (ident events and organize temporally)

    • template filling (analyze docs, fill out standard forms)

Named Entity Recognition

  • Named entities are anything refered to by a proper name

  • Requires identifying what text is the entity (using a span of text), and then classifying it

    • Can classify as people, organizations, locations, geo-political entities, facilities, etc.

  • IO/IOB tagging

    • Inside, outside, inside/outside/beginning tags

Neural Method

  • Character level embeddings, GloVe word embeddings

  • BIDI LSTM outputs concatenated and fed into softmax

  • Output is distribution over IOB labels

  • Ignores dependencies

    • Conditional random field used to constrain outputs

    • Probabilistic graphical model, edges denote influence

  • $x$ is input, $y$ is possible predictions,

    • $s(x, y) = \sum_{i} A_{y_{i},y_{i + 1}} + \sum_{i} P_{i,y_{i}}$

    • $P$ is softmax probability, $A$ is transition cores, from tag $i$ to tag $j$

    • Then, calculate softmax over scores

Evaluation

  • Recall, Precision, $F$ Measure used

  • Be careful with result interp, what may initially appear as one error can be multiple

Relation Extraction

  • Comes after NER

  • Relations depend on domain

  • Classes exist for each entity

  • Then must learn to classify and connect relations

  • Ordered tuples, many different forms of notation

  • Relations depend on domain, and many have specific models

Pattern-Based

  • Uses static patterns to recognize relations

  • uses templates, and can be very precise, because it can be easily customized for a domain, but may lack in recall

Supervised

  • Given a trained classifier for each

  • Enumerate all pairs and find those pairs

  • Classify relation of related pairs

  • Classifier can be whatever works for the domain

  • What features?

    • Headwords/concatenations

    • Bags-of-words, bigrams

    • NER types

    • Syntactic models

Semi-supervised

  • Labeled data is scarce, use seed patterns/seed tuples

  • Build a bootstrap relation set

  • Then use similarity to find new, high-confidence tuples, for a given relation

  • Extract patterns using these tuples

  • Weight patterns with confidence based on known tuples

Distant Supervision

  • Use a database like DBPedia, acquires large number of seeds

  • Identify matching sentences, extract rich feature set

Extracting Times

  • Similar to NER, can differentiate between absolute, relative and durations

  • Rule-based, sequence labeling

  • Normalized with respect to specific date for calculation/comparison

  • ISO standard provides XML tags

Event Extraction

  • Ident events in text, assignable to points in time

  • IOB tagging via supervised learning

  • Tends to be built on verbs

  • Features depend on many things

  • Timing information can provide temporal ordering, using normal relation extraction

  • Time-bank corpus can be used

Template Filling

  • Some situations/events have standard information

  • Templates represent these, with slots to fill

  • Recognize templates, then extract and fill slots.

Lecture 10: Question Answering

  • Factoid questions – answers are simple facts in short text expressions

  • IR-Based – search for span of text to return answer

  • Knowledge-Based build semantic rep, and uses deduction to answer from KB

  • Hybrid uses a combination of the two

Information-Retrieval Based

  • Given question, find short snippet from a collection

  • Process question for IR to analyze answers

    • Identify relevant documents

    • Extract passages

    • Filter to identify relevant passages

  • Must also do answer-type detection, decides what the answer will look like in the end

  • Question Processing

    • Identifies a query, and other information

    • Identifies focus

    • Identifies question type

    • Extracts tokens to submit

      • query is tf-idf and cosine similarity, can be enhanced with bigrams

      • If documents are limited, query expansion can be used (adding terms with a thesaurus, varied suffixes)

    • Answer types look at Named Entity types, filter IR on that

      • Can use taxonomy of types (WordNet)

      • Classifier trained, uses embeddings of words, etc

  • Passages come from ranked set

    • identify relevant paragraphs/sentences

    • partition documents into passages

    • NER used for filtering

    • Rank with learned model

  • Answers extracted with Span labeling

    • Return matching named entity if answer typue is specified

    • Problem hen answer type is unavailable/non-existent (consider definition questions)

      • Use question keywords matching

      • words not in query

      • Length of sequence of keywords in candidate

      • hand-crafted patters matching candidate answers

    • Neural Methods

      • DNN, uses SQuAD; computes embeddings for questions and embeddings for tokens

      • Compare embeddings, uses reading cqmprehension tasks

      • SQuAD dataset

      • Question $q$, passage $p$, $p_{s}(i)$ is probability that $p_{i}$ starts answer $p_{e}(i)$ is probability that $p_{i}$ ends answer span

      • Similarity scores used to predict start/end of answer span

      • Bi-LSTMs with attention, for question and passage; similarity scores used

      • $E$ is glove embedding

      • Input sequence of embeddings to get states

      • Combine into a query vector, being a weighted softmax sum, with $b$ and $w$ being learned weights

      • Passage embedded to vector sequence compared to $Q$

        • Computed similarly,using GloVE and PoS/NER tokens

      • Align attention between question and passage

Knowledge-Based

  • Answers in structured DB, no need for IR

  • Must be able to create a query

  • Requires Semantic Parsing to form SQL or predicate Calculus

  • Can be done using hand-written features or learned

  • Supervised

    • Dependency parsing, has things like tmod, nsubj

    • Uses labeled examples (question-query pairs)

    • learn general rules from inductive logic programming

  • Semi-suprevised

    • distant supervision/open information extraction

    • REVERB is an Open extractor

    • Extracts triples of strings and aligns to things like FreeBase (or other structured relations)

    • Match subject and argument to relations

    • Then learn mapping between natural language and FreeBase

DeepQA

  • IBM Watson!

  • Clinical Decision Support

  • Uses both methods

  • Many candidates proposed and scored by evidence

  • Stages

    • Question Processing

    • Answer generation

    • Answer Scoring

    • Confidence merging/ranking

  • Parsing, NER, relation extraction

  • Focus, answer type, question classification

  • Generate answers from structure/text sources

  • Queries using extracted relations

  • Eliminates stopwords, upweights terms from focus

  • Generates score vector from type, time/space relations, supporting evidence

  • Merge based on equivalence/combine scoring features

Evaluation

  • Mean Reciprocal Rank

    • Assume questions answers provided

    • If system under test returns list of answers, score is reciprocal of rank of first correct

    • Rank is rank of highest correct answer

  • Exact Match – fraction of predicted correct answers

  • $F_{1}$ score – harmonic mean of precision and recall

  • Datasets like SQuAD, TREX QA, TriviaQA, WikiQA

Lecture 11: Summarization

  • Similar to QA, but distills important information

    • Outlines

    • Extracts

    • Abstracts

    • Summary of threads

  • Single/Multiple documents

  • Generically, or focused on query

  • Steps

    1. Content Selection – Chose text to use

    2. Information Ordering – Decide how to order

    3. Sentence Realization – turn into readable sentences

Single-Document

  • Single Document

  • Headline/outline brief extract

Content Selection

  • Binary classifier (important vs unimportant)

  • Unsupervised using salience

    • TD-IDF weight of term in document (tf: log of count + 1), (idf: log of docs over doc frequency)

    • Log likelihood ration, $\lambda(i)$, difference in how likely to see in documents vs in corpus (explicitly probabilistic), $-2\log(\lambda(i)) > 10$ (or other empirically chosen threshold), term is significant

  • Supervised

    • Training set with human-created summary labels

    • Extracts: label sentence with 1 if in summary

    • Extract features and train classifier

    • But requires summaries are extracts, or align source and summaries to find best match

      • Longest common subsequence, edit distance, hidden Markov models can be used for this

  • Features can be labeled with things like

    • position where sentence appears

    • Cue phrases

    • Informativeness, topic signature

    • Sentence length

    • Cohesion – relatedness

Information Ordering & Sentence Realization

  • Single document preserves order of sentences in order of the document

  • Simplify/compress extracted sentences removing less relevant to summary

  • Rule-based or Machine-Learned

Multi-Document

  • A group of related documents

  • News stories on an event, etc.

Content Selection

  • Unsupervised approaches best (little data compared to single-document)

  • Similar methods

  • But, must be aware of redundancy across multiple documents

  • Can use a penalty term to each sentence, and could be cosine or alignment

  • Clustering related sentences and choosing center is also possible

Information ordering

  • Can't sequence in order of appearance in a document

  • News articles or documents with dates can sort chronologically

  • Must maintain coherence

    • Adjacent sentences: $s_{1}$ causes $s_{2}$

    • Lexical chains: synonyms/hypernyms

    • Consistency: pronoun reference to same noun consistent and unambiguous

  • For each sentence, pairwise compatibility is a graph problem

    • Sentences are nodes, edge weights are compatibilities

    • Permutation maximizing sum of compatibilities

    • Form of TSP – intractable, heuristics exist

Sentence Realization

  • Still might need to improve coherence

  • Coherence resolution, apply some rewrite rules

Focused

  • Summary in response to particular queries

  • QA but for non-factoid QA

  • Many documents potentially

  • Overlap with query, cosine similarity

  • Specialized systems for query type

  • Information Extraction to identify important information

Evaluation

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) N-gram overlap between generated summaries and ground-truths

  • Gross equation (slide 20)

  • For each bigram, how many times did generated summary use it?

  • Normalized with total bigrams in reference

  • A recall measure

  • Rouge-L measures longest-common subsequence

  • Rouge-SU measures skip bigrams (additional words between pairs)

  • Pyramid method/summary content units focus on units of meaning more relevant

Lecture 12: Machine Translation

  • challenging, complex

  • Must be quick, compact, and often deals with low data transmission

  • Many variations despite fundamental similarities

  • Can't translate word-for-word, often need to add, align, delete

  • Different kinds

    • Rough, basic ideas, web search, tourism

    • Human-post-edited for high-volume CAHT

    • Fully automatic high-quality

Classical Machine Translation

  • Direct – word-by-word through input text

  • Transfer – parse, translate structure, generate output

  • Interlingua – parse input, translate to intermediate representation, generate output

  • Vauquois Triangle captures approaches

Direct

  • Morphological analysis, getting word forms

  • Dictionary translates individually

  • Reorder by rules

  • Morphological generation, ensures correct form

  • Ignores structure, idiom, etc

Transfer

  • Alter structure to conform

  • Leverages contrastive knowledge, differences between two specific languages

  • Steps

    1. Analysis

    2. Transfer

      • Syntactic, fixes syntax

      • Lexical, meaning from input to output

    3. Generation

Interlingua

  • Language-independent form

  • Meaning is in a standard form

  • Semantic analyzer maps inputs to this

  • Representation could be predicate calculus, semantic decomp, event-based representations, etc

Statistical Machine Translation

  • A "noise channel model"

    • Probabilistic, outputs most probable translation

    • Given sentence $F$, $P(E|F)$ is probability output sentence is correct translation

    • Search for argmax $E$, by Bayes's rule

    • Probability of sentence $P(E)$ is from language model

    • $P(F|E)$ is from translation model

    • A decoder is used to output $E$

    • $F$ is a noisy translated $E$, and the goal is to recover $E$

Models

  • Phrase Based

    • Give $E$ and $F$, could go word by word, instead, use phrases

  • Group words into phrases, $I$ is number of phrases on english, $j$ is number of output frases

  • Reorder

  • $\phi(f|e)$ is translation probability

  • $d(a_{i} - b_{i-1}) = \alpha^{{a_{i} - b_{i-1} - 1}}$, $\alpha$ between 0 and 1 is distortion, based on movement of phrases

  • Probability is product of $\phi d$

    • Must train $\phi$!

    • Can be done with count/normalize method

    • Can count if training corpus includes translations and phrase alignments

    • Word alignments between inputs and outputs could also be used as phrase mappings

    • HMMs are generative for sequencses

    • Can use Viterby algorithm to generate sequence, Forward to generate sequence

    • Alignment Problem with $I$ states for each word, emission distribution is over output words

    • Most likely path generating sequence yields alignment

    • Symmetrization used for multi-word phrase alignment

    • All states are bidirectionnaly connected, including self-loops, each with a transition probability

    • Transition probabilities are based on a non-negative function $c$, on a jump width $i - i^{{\prime}}$

      • Neighboring words in input tend to neighbor in output

      • State-sequences not known

      • Baum-Welch algorithm to train HMMs used to train when sequences unknown

        • Starts with arbitrary values

        • Estimate expected counts

        • Use expected to re-estimate

        • Repeat until converge

Decoder

  • Takes input, outputs with maxmization

  • Searches through candidate space

  • Can't use gradient descent as not continuous

  • Exponential number, beam search/stack decoding is used for efficiency

  • Initialize with null

    • Pop best

    • If complete, return, else

    • for each expansion, score and push on to stack

  • Cost is combination of probability, and approx cost of completion

Evaluation

  • Human-based: people evaluate based on fluency, fidelity, how much is necessary to "fix" output

  • BLEU

    • Given output and reference, measure closeness

    • Weighted average of $N$-gram overlaps with reference translations

    • Modified precision, appearance based on max appearance in reference

    • $N$ grams for $N$ in 1–4

    • Brevity Penalty also used, based on length for shorter outputs

    • Easy to calculate, widely used

    • Doesn't consider meaning

    • Doesn't handle syntax

    • Not as good as human judgement

Lecture 13: Coreference Resolution

  • Identify noun phrases referring to same entity, that is, those that corefer to the same referent

  • Can be nested, important for MT for correct pronouns

  • Based on discourse model, incremental from entities in text and their relationships, first mention evokes, later access

  • Can also map to the real-world, entity linking to ontology

Background

  • Can refer

    • indefinite noun phrases

    • definite noun phrases

    • pronouns

    • demonstratives

    • non-stated pronouns

    • names

  • Information status

    • new/old

    • New could be new or old to hearer (if part of common knowledge)

    • Old were introduced previously

    • Some are inferred, based on known properties

  • Non referring expressions exist

    • Appositives

    • Predictives

    • Expletives

    • Generics

Tasks and datasets

  • Detect and cluster mentions

  • OntoNotes and ARRAU are common datasets

Mention Detection

  • IDentify spans of text for each mention, retrieves candidates and filters, all N grams <= 10

  • Filter extraction based on noun phrases, posessives, named entities, etc

Architectures/Classifiers

  • Entity-based represents each entity in a discourse model

  • Mention-based consider mentions independently

  • Models can also use ranking to compare antecedents

  • Mention-pair takes pair of mentions as input, outputs binary classification of coreference

    • hand-coded, or neural network

    • Each pair used for training

    • Filter negative, create pair of coreferants, create positive, create negative training for each for some k

    • Cluster mentions by pair top, and take transitive closure

  • Mention rank requires integration of resolution with anaphoricity by adding null

    • Estimate probabilities

    • At test time, pair top and take TC

  • Entity-based

    • Links mentions to discourse entities/mention clusters

    • Can use mention ranking approach, adding features

    • Or trains RNN to process clusters and combine hidden states

  • Handcoded features

    • Words/embeddings

    • Role

    • Length

    • Matching strings, etc

Neural Models

  • Span is contiguous sequence of $\leq 10$ words

  • For a span, estimate probability that each previous span is it's antecedent

    • $P(Y_{i}) = \frac{exp(s(i,y_{i}))}{\sum_{{y' \in Y(i)}}exp(s(i,y^{{\prime}}))}$

    • $s(i,j) = m(i) + m(j) + c(i,j)$

    • $m$ measures span is mention, $c$ measures if span is antecedent

      • $m$ is a feed-forward neural network

      • similar for $c$

  • Feed-forward NNs take input of vector representation (bi-LSTM w/ attention output)

  • Train my maximizing sum of probabilities of pairs in ground-truth

    • Minimizing cross-entropy

Evaluation

  • Precision and recall and combination, $F$-measure

  • $H$ is hypotheses, $R$ is references

    • Precision: $|H \cap R|/|H|$

    • Recall: $|H \cap R|/|R|$

    • $F$-measure: $2(precison)(recall)/(precision+recall)

    • MUC $F$-measure, H, R are links, pairs of mentions

    • $B^{3}$ are individual mentions

Entity Linking

  • Connected to coreference resolution, connects real-world to ontology

  • Detect mentions, disambiguate with supervised learning

  • They can help each other

Winograd Schema Problems

  • Kind of a sentence completion

  • Want to connect ambiguous references

  • Integrating world knowledge can assist in these issues

  • Winograd Schema Challenge include coreference problems solved with world knowledge but no automated methods

  • Somewhat solved with huge models

Gender Bias

  • Biases exist

  • Which pronoun connects more frequently with certain professions

  • Biased datasets

  • Gender-swapped dataset combined with original can help reduce

  • WinoBias, GAP datasets available

Lecture 14: Word Senses and WordNet

  • Dealing with ambiguity is hard

  • Words can have multiple senses (polysemous words), depending on context

  • WordNet is database of senses and relations

  • Word Sense Disambiguation determines sens of polysemous word based on context

Word Senses

  • Discrete representation of aspect of meaning

  • Discrete valued senses distinct from continuous embedding

  • Relationships between senses exist and can enhance semantics

  • Independent truth conditions, different syntactic behavior, and independent relationships? Do they exhibit antagonistic meanings

  • Zeugma is used for testing: "serve"

Relations Between Senses

  • Synonyms

  • Antonyms

  • Taxonomic, IS-A

  • Meronymous, PART-OF

  • Structured relationships, author/works-of-author

WordNet

  • 150k+ words, nouns, verbs, adjectives, adverbs

  • Each has definition, synonyms, usage examples, etc

  • Synonym sets (synsets), of senses, based on senses, not words

  • Supersenses, categories of senses

  • Sense relations

    • Hypernyms, hyponyms

    • Instance hyper-/hyponyms

    • Meronyms/part meronyms

  • Can form a graph structure with labels on edges

Word Sense Disambiguation

  • Chooses correct sense based on context

  • WordNet, application-specific dictionary, translations, etc.

  • Semantic Concordance – corpus with words labeled with sense

  • SemCor labels Brown corpus with WordNet

  • Can map from wikipedia/wikidata/wiktionary

  • Evaluate with $F_{1} =2PR/(P+R)$ precision/recall based

  • Baseline is most-frequent sense

Embeddings: Contextual

  • ELMo or BERT often used for contextual embeddings

    • Embeddings depend on word's context

    • Different from context free embeddings like word2vec or GloVe

    • BERT trained by masking random words and predicting

  • Each labeled token of a specific sense, compute BERT embedding, sense embedding is mean of context embeddings

  • Predict sense, compute contextual, nearest neighbor among sense embeddings

  • Sense can be absent from training data, for each sense, look up others in synset, hypernyms, etc, and average embeddings!

Feature-Based

  • Supervised training of classifier

  • Features are words, embeddings, PoS around, etc.

Lesk algorithm

  • No model

  • Knowledge based

  • For $w$ choose sense whose description overlaps most.

Induction

  • Unsupervised

  • Induce set rather than using pre-defined

  • For each word in vocab

    • For each instance in corpus

    • Compute context vector

    • Cluster these vectors (each is a sense of $w$)

    • Centroid/median vector is a sense vector for each sense $j$ of $w$

    • When processing, use nearest sense vector