Peek under the hood of GPT-3 in under 3 minutes.
So, you’ve seen some amazing GPT-3 demos on Twitter (if not, where’ve you been?). This mega machine learning model, created by OpenAI, can write it’s own op-eds, poems, articles, and even working code:
This is mind blowing.— Sharif Shameem (@sharifshameem) July 13, 2020
With GPT-3, I built a layout generator where you just describe any layout you want, and it generates the JSX code for you.
W H A T pic.twitter.com/w8JkrZO4lk
=GPT3()... the spreadsheet function to rule them all.— Paul Katsen (@pavtalk) July 21, 2020
Impressed with how well it pattern matches from a few examples.
The same function looked up state populations, peoples' twitter usernames and employers, and did some math. pic.twitter.com/W8FgVAov2f
If you want to try out GPT-3 today, you’ll need to apply to be whitelisted by OpenAI. But the applications of this model seem endless–you could ostensibly use it to query a SQL database in plain English, automatically comment code, automatically generate code, write trendy article headlines, write viral Tweets, and a whole lot more.
But what’s going on under the hood of this incredible model? Here’s a (brief) look inside.
GPT-3 is a neural-network-powered language model. A language model is a model that predicts the likelihood of a sentence existing in the world. For example, a language model can label the sentence “I take my dog for a walk” as more probable to exist (i.e. on the Internet) than the sentence “I take my banana for a walk.” This is true for sentences as well as phrases and, more generally, any sequence of characters.
Like most language models, GPT-3 is elegantly trained on an unlabeled text dataset (in this case, the training data includes among others Common Crawl and Wikipedia). Words or phrases are randomly removed from the text, and the model must learn to fill them in using only the surrounding words as context. It’s a simple training task that results in a powerful and generalizable model.
The GPT-3 model architecture itself is a transformer-based neural network. This architecture became popular around 2–3 years ago, and is the basis for the popular NLP model BERT and GPT-3’s predecessor, GPT-2. From an architecture perspective, GPT-3 is not actually very novel! So what makes it so special and magical?
IT’S REALLY BIG. I mean really big. With 175 billion parameters, it’s the largest language model ever created (an order of magnitude larger than its nearest competitor!), and was trained on the largest dataset of any language model. This, it appears, is the main reason GPT-3 is so impressively “smart” and human-sounding.
But here’s the really magical part. As a result of its humongous size, GPT-3 can do what no other model can do (well): perform specific tasks without any special tuning. You can ask GPT-3 to be a translator, a programmer, a poet, or a famous author, and it can do it with its user (you) providing fewer than 10 training examples. Damn.
This is what makes GPT-3 so exciting to machine learning practitioners. Other language models (like BERT) require an elaborate fine-tuning step where you gather thousands of examples of (say) French-English sentence pairs to teach it how to do translation. To adapt BERT to a specific task (like translation, summarization, spam detection, etc.), you have to go out and find a large training dataset (on the order of thousands or tens of thousands of examples), which can be cumbersome or sometimes impossible, depending on the task. With GPT-3, you don’t need to do that fine-tuning step. This is the heart of it. This is what gets people excited about GPT-3: custom language tasks without training data.
Today, GPT-3 is in private beta, but boy can I not wait to get my hands on it.