Morphological Smoothing and Extrapolation of Word Embeddings

1 Morphological Smoothing and Extrapolation of Word Embed...
Author: Mervin Lewis
0 downloads 3 Views

1 Morphological Smoothing and Extrapolation of Word EmbeddingsRyan Cotterell, Hinrich Schütze, Jason Eisner

2 Morphology matters! To Work on Morphology!Morphologically rich is not the exception, it’s the rule! Facts from WALS: 85% of all languages make use of affixation 80% mark verb tense through morphology 65% mark grammatical case through morphology ACL 2016 Proceedings: ≈ 3% of papers involve morphology I'm going to start this talk with a bit of a by making a small point about computational morphology. Personally, I think computational morphology is an exciting area of NLP, and I've always been a bit disappointed more people don't work on it. Morphology is the rule -- not the exception! What I mean by this is that if we want NLP systems that work for the majority of the world's languages, this is a problem we need to solve. To show this, I extracted some simple statistics from the WALS database of typological properties. - 85% of the world's languages make use of affixation in some form- 80% mark verb tense through morphology- 2/3 mark grammatical caseThat's a lot of languages.By way of comparison, I compiled my own statistics from the NAACL proceedings: 3%(6.0 / 181 of the papers (as far I could tell) dealt withmorphology.

3 Words in the Lexicon are Related!Word embeddings are already good at this! running running running Similarity Relatedness Morphology sprinting shoes ran Words exhibit various relations And these relations come in various flavors . For instance, we have similar words, related words, and morphological words. Ideally, we would want all of these notions Show images and highlight the morphological one. (THIS WORK!) Take a method that does the first two well, and add the third. It tells us “run” is similar to ”sprint”, but ”ran” and “sprint”. Goal: Put morphology into word embeddings PAST TENSE

4 Inflectional Morphology is Highly-StructuredGERUND PRES Inflectional Morphology is Highly-Structured running run RUN LEMMA This work focuses on the relations between words in an inflectional paradigm. In comparison to the other relations we mentioned before, morphologically related words display a certain amount of *structure* in their meaning. That is, across different words, we would expect the basic relations to hold. That is to say, just as we can predict the word form * for unknown words, we can predict their meaning as well. runs ran 3rd PRES SG PAST TENSE

5 SPRINT sprinting sprint sprints sprinted GERUND PRES LEMMA 3rd PRES SGPAST TENSE

6 Same Structure Across Paradigms!GERUND PRES wugging wug Same Structure Across Paradigms! WUG LEMMA wugs wugged 3rd PRES SG PAST TENSE

7 Research Question How do we exploit structured morphological knowledge in models of word embedding?

8 A Morphological Paradigm – StringsTenses Suffixes /Ø/ /s/ /ed/ /ing/ Pres 3P Pres Sg Past Gerund /run/ RUN SPRINT WUG CODE LOVE [run] [runs] [ran] [running] [sprint] /sprint/ [sprints] [sprinted] [sprinting] [wug] [wugs] [wugged] [wugging] /wug/ Verbs Stems So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms. [codes] /code/ [coding] /love/ [love] [loved] /bat/ BAT [bats] [bating] /play/ PLAY [played]

9 A Morphological Paradigm – StringsSuffixes /Ø/ /s/ /ed/ /ing/ Pres 3P Pres Sg Past Gerund /run/ RUN SPRINT WUG CODE LOVE [run] [runs] [ran] [running] [sprint] /sprint/ [sprints] [sprinted] [sprinting] [wug] [wugs] [wugged] [wugging] /wug/ Verbs So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms. [codes] /code/ [coding] /love/ [love] [loved] /bat/ BAT [bats] [bating] /play/ PLAY [played]

10 Why is “sprints” written like that?ing Concatenate sprint#ing Orthography (stochastic) This is a Bayesian network: Each ellipse is a random variable and we show a possible value, which was chosen given the values of its parents. These variables are stochastic, this one happens to be deterministic, this one is stochastic again. sprinting Modeling word forms using latent underlying morphs and phonology. Cotterell et. al. TACL 2015

11 Why is “running” written like that?Concatenate run#ing Orthography (stochastic) Orthographic Change: Doubled n! This is a Bayesian network: Each ellipse is a random variable and we show a possible value, which was chosen given the values of its parents. These variables are stochastic, this one happens to be deterministic, this one is stochastic again. running Modeling word forms using latent underlying morphs and phonology. Cotterell et. al. TACL 2015

12 A Morphological Paradigm – StringsSuffixes /Ø/ /s/ /ed/ /ing/ Pres 3P Pres Sg Past Gerund /run/ RUN SPRINT WUG CODE LOVE [run] [runs] [ran] [running] [sprint] /sprint/ [sprints] [sprinted] [sprinting] [wug] [wugs] [wugged] [wugging] /wug/ Verbs So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms. [code] [codes] /code/ [coded] [coding] [loves] /love/ [love] [loved] [loving] /bat/ BAT [bat] [bats] [bated] [bating] /play/ PLAY [play] [plays] [played] [playing] Prediction!

13 Matrix Completion: Collaborative FilteringMovies -37 29 19 29 -36 67 77 22 Users Now, I want to observe that formally speaking, this is exactly the same problem as another ML problem. Some users rated some movies, and you want to predict how they would rate other movies. -24 61 74 12 -79 -41 -52 -39

14 Matrix Completion: Collaborative FilteringMovies [ [ [ [ -6 -3 2 9 -2 1 9 -7 2 4 3 -2 [ [ [ [ [ ] -37 29 19 29 [ ] -36 67 77 22 [ ] One way to solve this is to assign each user, and each movie, a latent vector in 3-space so that the user’s rating of a movie, when it’s known, is predicted by the inner product of their vectors. A blue row vector times a red column vector gives a purple rating. -24 61 74 12 Users [ ] -79 -41 [ ] -52 -39

15 Matrix Completion: Collaborative Filtering[1,-4,3] [-5,2,1] Dot Product -10 Gaussian Noise Here’s a graphical model representation. It accounts for the difference between the observed data and the prediction with Gaussian noise, which allows for a bit of wiggle room. -11

16 Matrix Completion: Collaborative FilteringMovies [ [ [ [ -6 -3 2 9 -2 1 9 -7 2 4 3 -2 [ [ [ [ [ [ ] -37 29 19 29 [ ] -36 67 77 22 [ ] So now we can take more inner products to predict the held-out ratings. Notice that all we’re doing is approximating the original ratings by the entries of a rank-3 matrix – the product of this m x 3 blue matrix with this 3 x n red matrix is an m x n, rank-3 matrix. -24 61 74 12 Users 59 [ ] -79 -80 -41 [ ] -52 6 -39 46 Prediction!

17 Morphological Paradigm – VectorsTenses Suffixes New: This Work Pres 3P Pres Sg Past Gerund RUN SPRINT WUG CODE LOVE Verbs Stems In our last people, we showed how to extrapolate the form string given other latent form strings for its row and column. In this work, we will show to extrapolate the meaning vector given latent meaning vectors for its row and column. BAT word2vec embeddings PLAY

18 Things word2vec doesn’t know…Words with the same stem like “running” and “ran” are related Words with the same inflection like “running” and ”sprinting” are related Our Goal: Put this information into word embedding for improved embeddings!

19 Things word2vec doesn’t know…Words with the same stem like “running” and “ran” are related Words with the same inflection like “running” and ”sprinting” are related Our Goal: Put this information into word embedding for great good!

20 Morphological Paradigm – VectorsSuffixes Pres 3P Pres Sg Past Gerund RUN SPRINT WUG CODE LOVE Stems So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms. BAT PLAY

21 Morphological Paradigm – VectorsSuffixes Pres Past RUN LOVE Stems So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms.

22 Morphological Paradigm – VectorsSuffixes Pres Past RUN LOVE Stems So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms.

23 Morphological Paradigm – VectorsSuffixes Pres Past RUN LOVE Stems So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms.

24 Morphological Paradigm – VectorsSuffixes Pres Past RUN LOVE Stems Extrapolation is an extreme form of smoothing – when you have 0 forms. The matrix is close to what you would get by addition. We should smooth everything *jointly*.

25 Why does “running” mean that?Add Gaussian Noise This is a Bayesian network: Each ellipse is a random variable and we show a possible value, which was chosen given the values of its parents. These variables are stochastic, this one happens to be deterministic, this one is stochastic again.

26 Why does “sprinting” mean that?Add Gaussian Noise This is a Bayesian network: Each ellipse is a random variable and we show a possible value, which was chosen given the values of its parents. These variables are stochastic, this one happens to be deterministic, this one is stochastic again.

27 Morphological Paradigm – VectorsSuffixes Pres Past RUN LOVE Stems So let’s take a look at structured morphological information. Consider this table of English of partial verb paradigms. Same Offset!

28 Morphological Paradigm – VectorsSuffixes Pres 3P Pres Sg Past Gerund RUN SPRINT WUG CODE LOVE Stems Not finished yet BAT PLAY Linear!

29 Morphological Paradigm – VectorsSuffixes Pres 3P Pres Sg Past Gerund RUN SPRINT WUG CODE LOVE Stems Not finished yet BAT PLAY Predict!

30 running ran RUN RUN GERUND PAST Additive ModelIndeed, this is fundamentally a question about *compositionality*. Our goal to learn a function takes a lemma and a slot as an input and returns a vector encoding the meaning. Again, this is analogous to how to we made seek to model a form given a target in the paradigm?

31 Relation to Vector Offset Methodran 1) RUN PAST 2) RUN GERUND SPRINT GERUND ran SPRINT PAST Indeed, this is fundamentally a question about *compositionality*. Our goal to learn a function takes a lemma and a slot as an input and returns a vector encoding the meaning. Again, this is analogous to how to we made seek to model a form given a target in the paradigm? 3) running sprinting sprinted ran

32 Generating A Type Vector in the LexiconRUN GERUND Step 2: Sample true word vector Step 3: Sample observed vector Step 1: Sample morpheme vectors from prior running Explain both types of Gaussian noise. Step 2 (former 3) explains non-compositionality. Step 3 (former 4) explains the noise during word2vec training. The ratio of the two training terms if going to decide how much smoothing there is. Covariance of the second Gaussian is Inversely proportional to the number of times word2vec has seen the word so it results in more smoothing for rarer words running

33 Generating A Type Vector in the LexiconRUN GERUND running Explain both types of Gaussian noise. Step 2 (former 3) explains non-compositionality. Step 3 (former 4) explains the noise during word2vec training. The ratio of the two training terms if going to decide how much smoothing there is. running

34 Directed Graphical ModelRUN SPRINT GERUND PAST running ran sprinting sprinted Words do not occur in isolation! Indeed, they form a larger network. Emphasize joint reconstruction running ran sprinting sprinted

35 Smoothing and ExtrapolationAll word embeddings are noisy! Optimization during training incomplete Only observed a few tokens Our model smooths all of the word embeddings jointly based on morphological information Note: Extrapolation is extreme smoothing: when you’ve never seen the word!

36 Smoothing and ExtrapolationThe model jointly smooths all word embeddings

37 Gaussian Graphical ModelAll Conditionals (probability of child given parents) are Gaussian distributed: Exact inference is always tractable (through matrix inversion) General framework for reasoning over word embeddings! (Not limited to morphology) Post-processing model = lightning fast (10 seconds for 1-best joint inference of embeddings)

38 Where did we get the graph structure?run RUN infinitive runs 3rd pres. sing. running gerund ran past sprints SPRINT sprinted sprinting Answer: from morphological lexicons Words fine for words with more than 2 morphemes.

39 Where did we get the graph structure?RUN GERUND SPRINT PAST run RUN infinitive runs 3rd pres. sing. running gerund ran past sprints SPRINT sprinted sprinting running ran sprinting sprinted sprinted running ran sprinting

40 Why you should care! You aren’t going to see all the words!Too many, thanks to Zipf’s law But we know some words must exist:` Every English verb has a gerund, even if you didn’t see it in a corpus Can we guess its meaning? Open vocabulary word embeddings Simple to implement, to train and to extend!

41 Why use our model? We share statistical strength among morphologically related words Improves embedding qualify – especially for rare words We can impute embeddings for unknown words Simple to implement, to train and to extend!

42 How do we learn the embeddings?Learn the model parameters with Viterbi EM E-step: simple coordinate descent (10 sec.) M-step: update covariance matrix SPELL IT OUT See paper for more details!

43 Training Set-up Experimented on 5 languages: Czech, English, German, Spanish and Turkish Varying degrees of morphology: English → German → Spanish → Czech → Turkish Initial Embeddings are trained on Wikipedia tokenized text skip-gram with negative sampling 200 dimensions

44 Experiment 1: Vector PredictionRUN SPRINT GERUND PAST Task: predict this vector! running ??? sprinting sprinted Walk through results running ??? sprinting sprinted

45 Experiment 1: Vector PredictionChoose closest vector in space under cosine distance Baseline: standard analogies All details in the paper! Analogies also predict forms! sprinted ran running sprinting Predicts “ran” from “running”, ”sprinting” and ”sprinted”

46 Experiment 2: PerplexityHow perplexed is skip-gram on held-out data Just like standard language model evaluation Question: Do our smoothed and extrapolated word embeddings improve prediction?

47 Experiment 2: PerplexityUnsmoothed Perplexity (bits) # Observed Tokens Take Away: Smoothing helps! (More with fewer tokens). See paper for more details Smoothed

48 Experiment 3: Word SimilarityTask: Spearman’s ρ between human judgements and cosine between vectors Similarity is about lemmata, not inflected forms Use the latent lemma embedding! Mention Spearman’s Rho * 100

49 Directed Graphical ModelRUN SPRINT GERUND PAST Lemmata Word Embeddings running ran sprinting sprinted Words do not occur in isolation! Indeed, they form a larger network. running ran sprinting sprinted

50 Experiment 3: Word SimilarityTask: Spearman’s ρ between human judgements and cosine between vectors Similarity is about lemmata, not inflected forms Use the latent lemma embedding! Mention Spearman’s Rho * 100

51 Future Work Integrate morphological information with character-level models! Research Questions: Are character-level models enough or do we need structured morphological information? Can morphology help character-level neural networks? Put directly into objective

52 Thanks for your attention!Fin Thanks for your attention!

53 running RUN GERUND How do we compose words? ??? ????This boils down to a question about *compositionality*. Our goal to learn a function takes a lemma and a slot as an input and returns a vector encoding the meaning. Again, this is analogous to how to we made seek to model a form given a target in the paradigm? running

54 Keep it simple: Addition!RUN GERUND ???? Addition Indeed, this is fundamentally a question about *compositionality*. Our goal to learn a function takes a lemma and a slot as an input and returns a vector encoding the meaning. Again, this is analogous to how to we made seek to model a form given a target in the paradigm? running