In this instalment of his series on advanced topics in Generative AI, expert Ravi Das explores a cutting-edge solution known as the DALL-E-2 algorithm, which converts human language into an image that is used in an AI Output. 

One of the cutting-edge solutions that has occurred in Generative AI, and especially to that of ChatGPT, is having the functionality of taking a query that was submitted by, normal human language, and converting that directly to an image, as the Output (if this has been created by the end user). The specific algorithm that drives is the “DALL-E-2,” which was also developed by OpenAI. How this process works is the focal point of this article, and it is as follows:

    1. The Inputs: This is where the DALL-E-2 will take either an audio or a textual description of the image that is to be created. It is then transferred into the Generative AI Model.
    2. The Encoding: This is where the query that has been submitted (either via text or audio) is then further processed by the DALL E-2 Algorithm. At this point, it actually makes use of a specialized kind of Neural Network called the “Contrastive Language – Image Pre Training” also referred to technically as the “CLIP.” From here, the input that is provided by the end user becomes a mathematical based, vector representation of it. The goal here at this step is to capture as much as possible the “semantic meaning” of the input.
    3. The Conversion: At this point in the process, the vector-based representations that have been produced by the CLIP are directed yet into another Algorithm which is called the “Prior.” This too is a Diffusion Model (or it can also be an Autoregressive one, based upon the requirements that have been set forth onto the Generative AI Model), and this is deemed to be the first stage at which the submitted input actually starts to be converted into an image of sorts. To do this, the Prior makes use of a statistical based Probabilistic Model.
    4. The Generation: It is at this stage, after going through the required number of iterations in the last step, the Prior Algorithm has produced whatever is now thus transmitted over to the “Diffusion Decoder.” This where it is used to cover all of the mathematical vectors that have been computed in the last step now become recognizable images, which can in the end be used as an Output in order to satisfy the query that has been submitted to the Generative AI Model.

The Mechanics of the CLIP

For those who are technically oriented in the world of Generative AI, a common question that is asked is “How can the CLIP be trained at all, so as to be useful to a Generative AI Model?”  The answer: It consists of two major processes, which are known as the Top Part and the Bottom Part.

The Top Part: In this particular phase, as its name implies, it is the top part of the image which is first processed by the CLIP (assuming that an actual Image has been submitted as an input to the Generative AI Model). At this point, the following happens:

    1. The CLIP breaks down the image into certain “Shared Spots.” and further examines for any relevant meta data as it relates to the image.
    2. The metadata then becomes Statistically “Joined” with one another, which allows for any information about the metadata to be easily shared.
    3. Ultimately, it is this statistically Joined Process that lets the Generative AI Model (such as that of ChatGPT) to fully understand the relationship between any text (whether written or spoken) and images that have been submitted to it.

The Bottom Part: In this particular phase, as its name implies, it is the bottom part of the image which is processed second by the CLIP (assuming that an actual Image has been submitted as an input to the Generative AI Model). At this point, the following happens:

    1. The actual image generation happens in this phase, once again, using the appropriate Diffusion Model that has been selected to do this particular task.
    2. Any text inputs are also submitted to the DALL E-2 algorithm.
    3. The above-mentioned text inputs are further encoded by making use of the “CLIP Encoder”, which thus creates high quality, mathematical representations of the text input. These are also technically known as the “CLIP Text Embeddings.”
    4. These Embeddings are then processed through the Prior algorithm (which can once again be an Autoregressive based or Diffusion based Model). This algorithm then generates the “CLIP Image Embeddings,” which are used to mathematically correlate any Visual Context with that of the any Textual based Context.
    5. At this last stage, the CLIP Image Embeddings are then further decoded, by the appropriate “Diffusion Encoder.” It is here that the final images are created to satisfy the query that has been submitted to the Generative Model.

To summarize, it takes both processes to fully create an image, based upon its upper and lower bounds. This process is illustrated in the diagram below:

Up Next: The Stable Diffusion Model (aka Latent Diffusion Model)

The fifth and final article in this series will unpack the inner workings of the Stable Diffusion Model, also known as the Latent Diffusion Model (LDM), and will explain the distinct advantages of using it in Generative AI.

Sources/References:

Join the conversation.

Keesing Technologies

Keesing Platform forms part of Keesing Technologies
The global market leader in banknote and ID document verification

+ posts

Ravi Das is a Cybersecurity Consultant and Business Development Specialist. He also does Cybersecurity Consulting through his private practice, RaviDas Tech, Inc. He also possesses the Certified in Cybersecurity (CC) cert from the ISC2.
Visit his website at mltechnologies.io

Previous articleLIVE BLOG: U.S. passport online renewal system now fully operational
Next articleBurkina Faso’s New Biometric Passport Drops ECOWAS Logo