Apr 13, 2023

Understanding Word and Sentence Embeddings

Learn the fundamental building block behind large language models: Word and sentence embeddings. Understand how to represent words and sentences in a way that neural networks can understand.

AI LANGUAGE

Santiago

Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.

Member of Software Developers

Here is a quick introduction to the fundamental building block behind large language models:

Word and sentence embeddings. pic.twitter.com/nfKTaGKYRg
— Santiago (@svpino) April 13, 2023
The Internet is mainly text.

For centuries we've captured most of our knowledge using words, but there's one problem:

Neural networks hate text.
— Santiago (@svpino) April 13, 2023
Turning words into numbers is more complex than you think.

The simplest approach is to use consecutive numbers to represent each word in our vocabulary:

• King → 1
• Queen → 2
• Prince → 3
• Princess → 4
— Santiago (@svpino) April 13, 2023
Unfortunately, neural networks tend to see what's not there.

Is a Princess four times as important as a King? Of course not.

Here is a better representation:

• King → [1, 0, 0, 0]
• Queen → [0, 1, 0, 0]
• Prince → [0, 0, 1, 0]
• Princess → [0, 0, 0, 1]
— Santiago (@svpino) April 13, 2023
We call this particular representation "one-hot encoding."

One-hot encoding fixes the problem of networks misinterpreting ordinal values.

But the Oxford English Dictionary says there are 171,476 words in current use, so we need a smarter way to create our vectors.
— Santiago (@svpino) April 13, 2023
Here is where the idea of "word embeddings" enters the picture.
— Santiago (@svpino) April 13, 2023
We know that King and Queen are related just like Prince and Princess.

Word embeddings have a simple characteristic:

Related words should be close to each other, while words with different meanings should lie far away.

Can we create a better representation using this idea?
— Santiago (@svpino) April 13, 2023
I placed the 4 words in a two-dimensional chart.

I created this chart by hand, but an actual application would use a neural network to find the best representation. pic.twitter.com/jvlMEuqOQU
— Santiago (@svpino) April 13, 2023
Something critical becomes apparent:

King and Queen are close to each other, just like the words Prince and Princess are.

This encoding captures a crucial characteristic of our language: related concepts stay together!

And this is just the beginning.
— Santiago (@svpino) April 13, 2023
Moving on the horizontal axis from left to right, we go from masculine (King/Prince) to feminine (Queen/Princess.)

And if we move on the vertical axis, we go from a Prince to a King and from a Princess to a Queen.

Our embedding encodes the concepts of "gender" and "age"!
— Santiago (@svpino) April 13, 2023
We can derive the new vector encodings from the coordinates of our chart:

• King → [3, 1]
• Queen → [3, 2]
• Prince → [1, 1]
• Princess → [1, 2]

The first component represents "age.” The second component represents "gender."
— Santiago (@svpino) April 13, 2023
I used two dimensions for this example because we only have four words, but using more would allow us to represent other practical concepts besides gender and age.

For instance, GPT3 uses 12,288 dimensions to encode their vocabulary.
— Santiago (@svpino) April 13, 2023
Modern models embed tokens of arbitrary size, not just words.

This is certainly one of those ideas that have changed the field.
— Santiago (@svpino) April 13, 2023
I originally wrote this story for my newsletter subscribers.

You can receive these right in your inbox by subscribing here:https://t.co/WrAVnRGoNM
— Santiago (@svpino) April 13, 2023