Home About Projects

Open AI / GPT-3 Presentation

Joe Heitzeberg
February 24, 2022

What is deep learning?

Deep learning is a subset of machine learning (ML), where artificial neural networks—algorithms modeled to work like the human brain—learn from large amounts of data.

Deep Learning

Deep learning is a class of machine learning algorithms that uses multiple layers to progressively extract higher-level features from the raw input. For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces.

The adjective “deep” in deep learning refers to the use of multiple layers in the network.

Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and video games…

….where they have produced results comparable to and in some cases surpassing human expert performance.

In 2017, Google Brain released the “Transformer” model which achieved state-of-the-art results on a number of natural language processing and machine vision tasks.

Also in 2017, Google Translate replaced 500K lines of code with 500 lines of code by applying a Transformer model.

What is OpenAI / GPT-3?

When it was released in June 2020, GPT-3 was the largest neural network ever produced by a large margin.

GPT-3 is a deep learning neural network model with over 175 billion parameters (machine learning parameters) and trained on 570 GB of data – hundreds of billions of words from the Common Crawl, WebText2, Books 1, Books 2 and Wikipedia. OpenAI is made by a company called OpenAI.

To put things into scale, the largest trained language model before GPT-3 was Microsoft’s Turing NLG model, which had 10 billion parameters.

How the heck do they do that?

The 175 billion parameter GPT-3 model takes up 350GB of space and requires dozens of GPUs just to run it (and millions of dollars of GPU time just to train). OpenAI worked with Microsoft to create a supercomputer with 10,000 GPUs and 400 gigabits per second of network connectivity per server to train and serve its usage.

How to use it?

Effectively, you use GPT-3 via API calls by simply providing a prompt — just text — and getting back a response in return – a predicted text continuation – which is based on a conditional probability distribution of all possible next words, given the prompt and the deep-learned model.

Examples help understand how simple, yet how versatile and powerful this is. These are actual API input prompts in bold and actual API outputs in bold-italic that I ran through the API to create examples for this presentation.

Prompt: roses are red, violets are
Response: blue

Prompt: “roses are red, violets are blue” in Russian is:
Response: розы черные, васильки голубые

Another example:
I wrote an original poem to impress my love interest, Natalie, loosely inspired by roses are red / violets are blue, but involving something she loves, wine. Here it is: Roses are red,
And wine is too,
Natalie, I love you,
And I love wine just as much as I love you.

Another example: Write an original poem in the style of “roses are red, violets are blue” but make it about Crowd Cow.
A: Crowd Cow is the best,
No one can contest,
The meat is so fresh,
You can really taste the zest.

Useful Applications

  • Text generation
  • Translation
  • Summarization
  • Question answering
  • Categorization
  • Data extraction
  • Code generation


Text generation
(you’ve seen plenty of that)



Write "You've GOT to be fucking kidding me" in Japanese and Russian.


Japanese: 何てこった
Russian: Что за херня


Code given to GPT-3: “summarize the following code in English”

def func1 i = 0 while i<=2 puts “func1 at: #{Time.now}” sleep(2) i = i+1 end end def func2 j = 0 while j<=2 puts “func2 at: #{Time.now}” sleep(1) j = j+1 end end puts “Started At #{Time.now}” t1 = Thread.new{func1()} t2 = Thread.new{func2()} t1.join t2.join puts “End at #{Time.now}

GPT-3 Output:

The code creates two threads, func1 and func2, and starts them both running. func1 prints the current time, then sleeps for two seconds, before printing the time again. func2 prints the current time, then sleeps for one second, before printing the time again.

Question Answering


What do these people have in common? Jon Ludwig Hammer Sergey Karjakin Vladimir Kramnik Shakhriyar Mamedyarov Arkadij Naiditsch Hikaru Nakamura Ian Nepomniachtchi Yannick Pelletier Teimour Radjabov Richard Rapport


They are all grandmasters in chess.



​​What is the main category for each of these? 
1) Wall Street’s benchmark S+P 500 index fell 1.8% to an eight-month low after the Kremlin said rebels in eastern Ukraine asked for military assistance. Moscow has sent soldiers to some rebel-held areas after recognizing them as independent. 
2) Multivariate regression attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to changes in others 
3) A comparison is set up between the brook and the poem’s speaker who trusts himself to go by “contraries”; further rebellious elements exemplified by the brook give expression to an eccentric individualism 
4) According to the definition stated in the outset, the transcriptionist would be required to type that exact sentence, word for word, right where it was dictated. Hypertension would never actually be added to the diagnosis list; it would simply linger within this strange little sentence somewhere midway through the report. Certainly this would not contribute to the goal of the medical record.


1) Business and Economics 
2) Mathematics 
3) English Language and Literature 
4) Healthcare

Data extraction


Flowers are the reproductive organs of plants. They are typically characterized by their colorful petals, and the way they are arranged on the plant. Most flowers have sepals, petals, stamens, and a pistil. The sepals are the outermost part of the flower, and often look like leaves. The petals are the innermost part of the flower, and are often the most colorful. The stamens are the male parts of the flower, and the pistil is the female part. The ovary is at the base of the pistil, and contains the ovules. The ovules will eventually become seeds.

Extract the key things from the above and create an array of JSON hashes with name, description and purpose fields.</prettycode>


    name: "sepal",
    description: "The sepals are the outermost part of the flower, and often look like leaves.", 
    purpose: "The sepals protect the flower while it is still in the bud." 
    name: "petal", 
    description: "The petals are the innermost part of the flower, and are often the most colorful.", 
    purpose: "The petals attract pollinators to the flower." 
    name: "stamen", 
    description: "The stamens are the male parts of the flower.", 
    purpose: "The stamens produce pollen, which is transferred to the pistil." 
    name: "pistil", 
    description: "The pistil is the female part of the flower.", 
    purpose: "The pistil receives pollen from the stamens, and the ovules in the ovary will eventually become seeds." 
    name: "ovary",
    description: "The ovary is at the base of the pistil, and contains the ovules.",
    purpose: "The ovules will eventually become seeds."

Code Generation


# Use newsapi.org API to find news for "crowd cow"
# If any are found, post them to Slack "news" channel.


def news_watcher(term="Crowd Cow") news_api_key = ENV['NEWS_API_KEY'] slack_api_key = ENV['SLACK_API_KEY'] uri = URI("https://newsapi.org/v2/everything?q=#{term}&apiKey=#{news_api_key}") response = Net::HTTP.get(uri) news = JSON.parse(response) news["articles"].each do |article| if article["publishedAt"] date = DateTime.parse(article["publishedAt"]) if date > DateTime.now - 1 message = "#{article["title"]} - #{article["url"]}" uri = URI("https://slack.com/api/chat.postMessage?token=#{slack_api_key}&channel=news&text=#{message}&as_user=true") response = Net::HTTP.get(uri) end end end end </pre>

How the basic API works

Given any text prompt, the API will return a text completion, attempting to match the pattern you gave it. You can “program” it by showing it just a few examples of what you’d like it to do…and all of that “programming” is in the single parameter, “prompt” as per below.

API Request:

  "prompt": "roses are red, violets are",
  "max_tokens": 5,
  "temperature": 1,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "logprobs": null,
  "stop": "\n"


  "id": "cmpl-uqkvlQyYK7bGYrRHQ0eXlWi7",
  "object": "text_completion",
  "created": 1589478378,
  "model": "text-davinci-001",
  "choices": [
      "text": "blue",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"

The other parameters are used to specify and tweak the engine and it’s behavior. Some of the important ones are:

  • model: the name of the model you are using.
  • temperature: a number between 0 and 1… a way of controlling the randomness and creativity of the text generated
  • top_p: a number between 0 and 1…an alternative way of controlling the randomness and creativity
  • n: how many result sets should be returned.
  • stop: up to 4 things that, if found in the output, will cause the output generation to stop

Models (aka Engines)

The OpenAI API is powered by a family of models with different capabilities and price points. Engines describe and provide access to these models.

GPT-3: A set of models that can understand and generate natural language

  • text-davinci-001
  • text-curie-001
  • text-babbage-001
  • text-ada-001

Davinci is the most capable engine and can perform any task the other models can perform and often with less instruction. Davinci costs more per API call and is not as fast as the other engines.

Ada is usually the fastest model and can perform tasks like parsing text, address correction and certain kinds of classification tasks that don’t require too much nuance. Ada’s performance can often be improved by providing more context.

Codex: A set of model that can understand and generate code

  • code-davinci-001
  • code-cushman-001

Cushman is slightly faster and less capable than Davinci.

Pitfalls and Shortcomings

  • The quality of the output is highly dependent on the prompt design.
  • Prompt design is an art and a science.
  • For a given prompt, it’s common to experience that results are mind-blowingly good 85% of the time, and utterly garbage (or dangerous) 15% of the time.
  • Latency of 50ms to 500ms per request.
  • Costs while low can seriously add up quickly if you’re not careful.
  • GPT-3 doesn’t know about anything in the world that happened after 2019 or so

The art and science of prompt design

These magic of these very large deep learning language models is that they understand the deeper meaning of words, and draw associations from the knowledge embedded across the universe of source materials, which in this case is the internet.

The challenge in tapping this knowledge is that there is an art and a science behind writing good prompts.

Because the engine’s job is to predict the next best characters given the input, it will always try to do so, and the results can be downright dangerous (right and wrong at the same time).

Example-1: Who was the president before George Washington? A: John Adams

Interpretation: there was no president before George Washington. So “John Adams” is a wrong answer – a lie. But in an “AI logical” way – based on statistics, the answer it gave makes perfect sense:

  • John Adams is the president after George Washington
  • There is only one word “wrong” in my question that makes the answer correct.
  • The question isn’t a logical question to ask, since everyone knows that George Washington was the first president.
  • The question test included “asterisks” around the word “before”…. Therefore the AI might have concluded that the word “before” was an error – or at least worth highlighting – and therefore opened the door to answering the “after” version of the question.

It’s important to realize the the AI isn’t doing any reasoning as I have imagined above. That said, the 175-billion parameter deep learned model in fact does start to approximate the way the human brain works in some sense.

Any workarounds to this?

Example-2: Welcome to the social studies test! If you don’t know the answer or if the question is not logical, write “N/A”. Q1: Who was the president before George Washington? A1: N/A

Interpretation: providing “context” help nudge the AI towards the types of responses that would be more expected in this case:

  • “Welcome to the social studies test!” will help the AI give better, more factual answers
  • Providing explicit guidance for what do to when no good answer exists

Working with GPT-3 and template development feels as much like training a child to do a task as it feels like programming a computer. The results are “squishy” and directions are followed in a loose and unpredictable way – just like a human. But the results can be organic, articulate, creative and insightful – also just like a human.

Other Open AI APIs

The family of models are used (and fine-tuned) to power a few different API than the one I’ve described so far. The API I’ve described so far is called “Completions” but there are a few more that have come out since.

The Classifications endpoint (/classifications) provides the ability to leverage a labeled set of examples without fine-tuning and can be used for any text-to-label task. By avoiding fine-tuning, it eliminates the need for hyper-parameter tuning. The endpoint serves as an “autoML” solution that is easy to configure, and adapt to changing label schema. Up to 200 labeled examples or a pre-uploaded file can be provided at query time. (Note: we should use this for our zendesk tickets classification, but instead we use completions, because the classifications endpoint didn’t exist at the time.)

The Search endpoint (/search) allows you to do a semantic search over a set of documents. This means that you can provide a query, such as a natural language question or a statement, and the provided documents will be scored and ranked based on how semantically related they are to the input query.

This doesn’t sound too interesting until you realize that “semantic search” means a user could search “I’d like something with a high vitamin B content, can be cooked quickly, and goes well with merlot” and get back matching products that make sense for that.

Answers (/answers) is a dedicated question-answering endpoint useful for applications that require high accuracy text generations based on sources of truth like company documentation and knowledge bases. The additional context can be provided either as a list of up to 200 documents or as a pre-uploaded file to go beyond that limit.

I tried earlier and very limited Completions input to try to make a “crowd cow chat bot” but it was unable to load our entire knowledge base, and got promising results. This could be the answer and I just haven’t had time to try it yet.

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won’t need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests. At a high level, fine-tuning involves the following steps: Prepare and upload training data Train a new fine-tuned model Use your fine-tuned model

Note: Fine-tuning came out before answers and I quickly tried to load 5000 actual support tickets and answers, but I didn’t get meaningfully better results on my chatbot, so I gave up.

Embeddings is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format.

Note: this seems to be something that would be useful indirectly -as part of the implementation of a lower level process for data processing or analysis. I didn’t really understand this when I read about it.


Prices are per 1,000 tokens. 1,000 tokens is about 750 words of text. Usage counts prompt input and output.

  • Davinci: $0.0600 / 1K tokens
  • Curie: $0.0060 / 1K tokens
  • Babbage $0.0012 / 1K tokens
  • Ada: $0.0008 / 1K tokens

The following text is 34 tokens: Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal

And so a task to translate that to Spanish (including output) looks like: Translate "Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal" to Spanish: *Cuatro score y siete años atrás nuestros padres trajeron a este continente, una nueva nación, concebida en libertad, y dedicada a la proposición de que todos los hombres son creados iguales.*

That’s 110 tokens in total including the output.

Therefore, using the most expensive and powerful engine, Davinci, this would cost: (0.06 / 1000) * 110 or $0.0066 to run once.

Put it this way: Harry Potter’s first book has 76,944 words. So to translate the entire book using the above would be around 205,000 tokens and would therefore cost $12.30 in total. It’s not free, but it’s fairly inexpensive.

On a cost/quality basis, compared to paying humans, it’s quite disruptive.

GPT-3 can already do a lot, but you may be skeptical given its current limitations and quirks. Just keep in mind that the technology is on a steep improvement curve with quality going up and costs going down quickly.The first digital cameras looked like toys compared to “real cameras”.

Sandbox tools

At beta.openai.com: They have a completions sandbox with a text area as the prompt. They have an interactive codex sandbox that generates (and runs) javascript and HTML on the fly. They offer a command line tool

Sloppy Joe Wrapper

I created some infra to build upon Open AI’s completions API specifically so I could store prompts in a database for re-use. I added basic slug replacement, and made it so my stored templates can be used as an API by substituting the slugs before posting to Open AI.

  • UI for editing / playing with them
  • API to access them
  • Bulk generate csv upload/download

The Arms Race

When GPT-3 was launched in private beta last year by Open AI it caused a lot of jaws to drop, not just because of its size – 175 billion parameters – but because it could generate text that was indistinguishable from human-written text and because it performed as well or better than all the state of the art natural language processing models at the time.

Open AI set off an arms race.

Open AI is rumored to have 10x larger model in the works that is 100 Trillion Parameters — 500x the Size of GPT-3.

EleutherAI released GPT-NeoX-20B as an open source model, a 20 billion parameter model. Compared to Open AI’s GPT-3 which has 175 billion parameters.

AI21 Labs, an Israeli AI company, launched Jurassic-1 which is slightly bigger (3 billion more) than GPT-3.

Huawei announced a 200 billion parameter PanGu-α, which was trained on 1.1 terabytes of Chinese text.

Wu Dao 2.0 was trained on 4.9TB of high-quality text and image data. In comparison, GPT-3 was trained on 570GB of text data, almost 10 times less. Wu Dao 2.0 follows the multimodal trend and is able to perform text+image tasks.

South Korean company Naver released the 204 billion parameter model HyperCLOVA

NVIDIA and Microsoft teamed up to create the 530 billion parameter model Megatron-Turing NLG.

In January 2021 Google published the paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. They presented the switch transformer, a new neural net whose goal was facilitating the creation of larger models without increasing computational costs.

Other Interesting Developments

Hugging Face 🤗: A marketplace and community of many hosted models and algorithms of this genre, spanning NLP, image and audio. The company’s aim is to advance NLP and democratize it for use by everyone.

Key thing here is smaller, faster models with the expectation that you’ll likely train or fine tune the data for your particular application.

They have:

  • An open-source library of NLP models and algorithms
  • A marketplace of hosted models and algorithms
  • A community of developers, data scientists, and machine learning experts

Hugging Face is strong in the Transformers library. Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation.

🤗 Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. These models can be applied on:

  • 📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages.
  • 🖼️ Images, for tasks like image classification, object detection, and segmentation.
  • 🗣️ Audio, for tasks like speech recognition and audio classification. And they host the trained models and expose them with APIs. So it’s a lot like Open AI, but with more model variety – each less powerful but more specialized than GPT-3.

BERT is a new language representation model that is designed to pre-train deep bidirectional representations from unlabeled text. It has been shown to be empirically powerful, obtaining new state-of-the-art results on a variety of natural language processing tasks.

Demo: * Hugging Face categorization, text to speech.

OpenAI presented DALL·E in February 2021 in a paper titled Zero-Shot Text-to-Image Generation. The system, named after Spanish painter Salvador Dalí and Pixar’s cute robot WALL·E, is a smaller version of GPT-3 (12 billion parameters), trained specifically on text-image pairs. In the words of OpenAI’s researchers: “Manipulating visual concepts through language is now within reach.”

Google presented LaMDA in their annual I/O conference on May 2021. LaMDA is expected to revolutionize chatbot technology with its amazing conversational skills.

MUM stands for Multitask Unified Model. It’s is a multitasking and multimodal language model 1000x more powerful than BERT, its predecessor. It has been trained in 75 languages and many tasks which gives it a better grasp of the world. Enabling searches like, “You’ve hiked Mt. Adams. Now you want to hike Mt. Fuji next fall, and you want to know what to do differently to prepare.” (MUM is effectively the next generation replacement or BERT. And BERT is one of the most popular underlying models on Hugging Face. Exciting times :)

Case Study: Extract Testimonials from Zendesk Tickets

We get a lot of customer care tickets and usually these are related to order status or problems. That said, occasionally there are praise and testimonials contained in the test. It’s extra work for customer care agents to recognize those and put them for use in our testimonials database, and because the praise is few and far between in a sea of “issues”, they almost never do it.

But this is a perfect use case machine learning, right?

Spoiler: yes and no.

Problem 1: quality

Creating GPT-3 template capable of extracting sentiment and testimonials from ticket text took severals hours of empirical testing and adjusting the prompt against a sample of 100 or so actual tickets, but would only perform well enough using the Davinci engine. The overall goal was to distinguish testimonials suitable for use in marketing separate from merely detailed or passionate support tickets.

In the end, it required a quick long prompt jammed with many examples and counter-examples in order for the AI to perform well enough.

Problem 2: cost

The large prompt size and Davinci engine usage led to a quite high cost to run inferences. With Davinci, the most powerful but most expensive engine, running a template of this size costs around 9 cents per run. That really adds up. To scan all Zendesk tickets continuously would cost $5K to $10K per year, where only a small fraction would contain high quality testimonials. Not worth it.

Less powerful models like Ada and Babbage would be affordable, but were terrible. “Thanks for the quick reply Brian.” is not a testimonial, but Ada and Babbage routinely mis-identified phrases like that.

Only Davinci worked well.

Example: input Hello. Thank you for shrimp order. Please send me (4) bags of the same shrimp as soon as you can. By the way, I did not get complimentary bacon in last order? Thank you so much. We love the shrimp! output Testimonal: We love the shrimp!

Final Thoughts

In a few years, the above $1800 extraction tasks will cost $1.80, run 10X faster and be 10X more accurate. The tools to create and use templates will be easy, and it will be easy to sprinkle AI into all sorts of software systems. Over time, more and more tools will include GPT to power voice driven interfaces and to remove manual steps, creating much more seamless user experiences and powerful workflows, as well as to bring the “power of all human knowledge” into systems.

Today I can envision using GPT-3 to write short content snippets and to categorize and route Zendesk tickets.

But I’m most excited about a GPT-5 or GPT-9 world. Imagine pulling in product feeds, pricing tables, inventory forecasts, promotions, all content from blog posts, all help center articles and FAQs to power personalization, search, customer support and email marketing at the level of a knowledgeable human working with each customer on a 1:1 basis as their personal shopper, in realtime 24x7.

Ready for more?

Check out other posts from this blog.

View all »