Understanding Word and Sentence Embeddings
Learn the fundamental building block behind large language models: Word and sentence embeddings. Understand how to represent words and sentences in a way that neural networks can understand.
Santiago
Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.
-
Here is a quick introduction to the fundamental building block behind large language models:
— Santiago (@svpino) April 13, 2023
Word and sentence embeddings. pic.twitter.com/nfKTaGKYRg -
The Internet is mainly text.
— Santiago (@svpino) April 13, 2023
For centuries we've captured most of our knowledge using words, but there's one problem:
Neural networks hate text. -
Turning words into numbers is more complex than you think.
— Santiago (@svpino) April 13, 2023
The simplest approach is to use consecutive numbers to represent each word in our vocabulary:
• King → 1
• Queen → 2
• Prince → 3
• Princess → 4 -
Unfortunately, neural networks tend to see what's not there.
— Santiago (@svpino) April 13, 2023
Is a Princess four times as important as a King? Of course not.
Here is a better representation:
• King → [1, 0, 0, 0]
• Queen → [0, 1, 0, 0]
• Prince → [0, 0, 1, 0]
• Princess → [0, 0, 0, 1] -
We call this particular representation "one-hot encoding."
— Santiago (@svpino) April 13, 2023
One-hot encoding fixes the problem of networks misinterpreting ordinal values.
But the Oxford English Dictionary says there are 171,476 words in current use, so we need a smarter way to create our vectors. -
Here is where the idea of "word embeddings" enters the picture.
— Santiago (@svpino) April 13, 2023 -
We know that King and Queen are related just like Prince and Princess.
— Santiago (@svpino) April 13, 2023
Word embeddings have a simple characteristic:
Related words should be close to each other, while words with different meanings should lie far away.
Can we create a better representation using this idea? -
I placed the 4 words in a two-dimensional chart.
— Santiago (@svpino) April 13, 2023
I created this chart by hand, but an actual application would use a neural network to find the best representation. pic.twitter.com/jvlMEuqOQU -
Something critical becomes apparent:
— Santiago (@svpino) April 13, 2023
King and Queen are close to each other, just like the words Prince and Princess are.
This encoding captures a crucial characteristic of our language: related concepts stay together!
And this is just the beginning. -
Moving on the horizontal axis from left to right, we go from masculine (King/Prince) to feminine (Queen/Princess.)
— Santiago (@svpino) April 13, 2023
And if we move on the vertical axis, we go from a Prince to a King and from a Princess to a Queen.
Our embedding encodes the concepts of "gender" and "age"! -
We can derive the new vector encodings from the coordinates of our chart:
— Santiago (@svpino) April 13, 2023
• King → [3, 1]
• Queen → [3, 2]
• Prince → [1, 1]
• Princess → [1, 2]
The first component represents "age.” The second component represents "gender." -
I used two dimensions for this example because we only have four words, but using more would allow us to represent other practical concepts besides gender and age.
— Santiago (@svpino) April 13, 2023
For instance, GPT3 uses 12,288 dimensions to encode their vocabulary. -
Modern models embed tokens of arbitrary size, not just words.
— Santiago (@svpino) April 13, 2023
This is certainly one of those ideas that have changed the field. -
I originally wrote this story for my newsletter subscribers.
— Santiago (@svpino) April 13, 2023
You can receive these right in your inbox by subscribing here:https://t.co/WrAVnRGoNM