Generative AI for Media: Tal Lev-Ami's Vision of AI-Driven Visual Revolution at Cloudinary

Jan 15, 2024 at 1:25 PM24 min read

In the rapidly evolving media landscape, Israeli startup, Cloudinary, is setting the precedent. They're relied on by thousands of websites to handle media processing, and in recent months, they're competing head-to-head with the giants in offering Generative AI for media. From image optimization to pioneering Generative AI, the bootstrapped unicorn leads the way, now harnessing Large Language Models (LLMs) and offering category-defining media generation.

Today, I sat down with Tal Lev-Ami, Cloudinary's visionary on the frontlines of this revolution, to explore the exciting world of Generative AI, where creativity and technology blend into a single notion. Cloudinary's automated, AI-driven media management empowers brands to deliver dynamic digital experiences at scale. Lev-Ami's journey with Cloudinary began with a commitment to excellence and a prescience into AI's potential.

As our conversation unfolded, we explored Generative AI's intricacies, enterprise integration challenges, and the future of digital visual media.

How long have you been using Generative AI? What were the first uses that you used it for?

We've been using AI for many, many years. From the early start, we had all sorts of algorithms to find the optimal quality, optimal crop, and location. Generative AI is allowing a new generation of visual media using LLMs. This is basically around two areas: one is the generation of visual media and the other is large language models. Both of them are under this envelope, called Generative AI, but they're two separate things that come from separate advances.

On the visual side, the first generative feature we had was style transfer, where you take a source image that you want to update and match it against another image that contains the style that you want it to have - whether it is that of a particular artist or your brand guidelines or something like that. The AI then generates a derivative image that matches the content in the original image to the desired style. We've had that for around five years now.

Every time new capabilities appear in the industry, the state-of-the-art moves and we try to keep pace and do even a bit more. One important aspect is that we aim to serve organizations, in many cases at the enterprise level, and they need something that is reliable. So, we are focused on identifying and prioritizing the technology innovations that are already good enough to use at the enterprise level, those which any brand can deploy and trust the results.

About a year ago, the emergence of GPT-4 and Stable Diffusion were both significant leaps in their respective fields. So we've made a concentrated effort in the company to determine exactly what we should do in the field of Generative AI and how we can best build upon our legacy of AI expertise to provide enterprise-ready solutions for our customers. In the last year or so, we've launched multiple Generative AI features that brands are taking advantage of to accelerate the value of their visual assets.

Tell us about the new features that you've added using Generative AI.

We have multiple features. On the visual side of things, one of the features is called Generative Fill. And the idea is that when you needed a different aspect ratio of an image in the past, you either had to lose some of the pixels or add black padding or white right padding around it. Today, you can ask the algorithm to expand the canvas and put something outside the original frame that fully matches the context of the current image.

We also have a feature we call Generative Erase where a user can tell the AI the name of an object that they don't want in the image. The AI instantly erases it and replaces it with a background consistent with the context of the environment featured in that image. Similarly, we have a Generative Replace feature which replaces a chosen object within an image with another item or subject, whatever the user asks for in their prompt generated by that prompt.

We also have Generative Recolor, which allows a user to take an object in a given color and replace that color with the color you choose - and which respects detailed requests for concepts such as shading, shadow, and light. We have Generative Restore, which takes an image that was either overcompressed or a bit blurry or lower quality and makes it pristine. We have Upscale, which takes a smaller image and enlarges it, inventing new details that were not there in the original image, to provide high-quality assets that can be published in any context. We also have Background Removal, which is not exactly generative, but it's based on similar concepts, and is something we’ve offered for some time.

These are things that we're doing on the visual side. There are more in the pipeline on the textual side of things which is a different category. We're a very visual company. It's all about the images, the video, the 3D objects, and less about the text. So we wanted to see what we could do with this incredible technology that will be useful to our customers using LLMs.

To do this, we’ve already released a tool that provides users with a chat interface and allows them to do image transformations based on a complicated set of written instructions on how to manipulate the image. The user can save the request and apply it to any other image as well. They can just do it in the URL since you need to compress all these instructions to a single URL.

It’s possible to do very complicated things with it, but it requires some technical knowledge. To make it accessible to everyone, we built a chat interface that connects with our documentation and allows you to ask the AI in natural language to do things like crop the image to focus on a specific object, add a watermark on the top left corner, resize it to the correct size and so on. The key is to make the interface more friendly to people that didn't invest the time in reading the entire documentation and getting up to speed with how you could do it yourself.

One important thing that we didn't release as a feature, but is really a separate product is what we call Final Touch, which allows you to do virtual photoshoots. You upload an image of an object and it will isolate that object and allow you to easily place that object in various scenes that it can generate. You can tell it what type of theme and style you want. It's really cool, and very visual. And it's very easy to use. A lot of what we do is designed to empower developers. But this one is something that anybody can play with.

There is a lot more we’re working on, and this is a rapidly changing space where we’re excited about the new and increased value we can bring to our users.

Related startups

So you've been releasing quite a lot of new features, kind of one by one over the last year. What's the strategy behind it?

We see Generative AI as an opportunity to give more value to our customers, that's what we always try to do. So we're taking use cases that we know from our customers that either they already use or would want to use and we're using the best technology out there to do that.

With the advent of Generative AI, a lot of things that were hard to do in a reasonable way before are now becoming more reasonable, higher quality, etc. For example, this concept of erasing an object in an image has existed for a long time, but now you can do a much better job. It is creating a background that looks good even if the background is complicated. So that's what we built. We are looking at this from a perspective of what were the things that we always wanted to do correctly that we can do now as models have matured.

The other aspect is basically exploration. For example, a chat interface. Is it a good thing? Is it a bad thing? We don't know yet! It's become very prevalent right now. But we don't know whether it's just because that is what's currently available to access this very advanced technology or if this is actually the best way to interact with the product.

To find the answer, we decided to experiment. This whole Transformation Builder chat interface is basically an experiment to see whether people find that a useful approach or if they also need to have a visual transformation builder (a graphical interface) where they can choose different steps and it will do that for you. Is that better? Or is a chat interface better? Time will tell and it may be different for different users or use cases.

You mentioned ChatGPT. Is that the main LLM or are you using any other LLMs?

We are experimenting with multiple LLMs including GPT-4, Claude, Claude+ and also with a bunch of open source LLMs, and I can't even tell you what's the current one that we're experimenting with, because every week or two, there is a new one. It's so much better than the previous ones that it's very hard to keep track.

One of the reasonably recent advances in the LLM world are these multi-modal LLMs that can also ingest images - and hopefully other formats in the future - and reason about them. That's also something that's very relevant for our customers and we're working on a bunch of features based on these.

Is your expectation that you'll be using multiple LLMs for the indefinite future, or do you think you will be able to reduce it to one or two in the end?

I think the whole world wants to know the answer to that question. Is there a place for a plethora of LLMs or will this be a winner takes all type of market? At some point, we might reach Artificial General Intelligence (AGI), and then all bets are off. But until we reach that point I think that there is a strong argument toward the value of having multiple types of LLMs.

LLMs are trade-offs between the computational complexity, how good they are in various tasks, the speed of execution, the cost of execution, whether you can do it in your internal network, or you need to move into a third party. All these constraints mean that for various tasks there are better candidates. So, you can use GPT-4 for many things. It's better in many cases since it’s such a large and advanced model, but it's also expensive and rather slow. So if you can use a smaller model, and it's good enough for a particular use case, and you can run it on your own hardware, and you can run it fast and cheap, then it makes sense to use it for that use case.

I believe that in the foreseeable future, we'll see different LLMs being used for various tasks. General-purpose LLMs will be more common for exploratory endeavors, where the goal is not yet defined and the strongest LLM is required. However, for specific tasks, the simplest tool capable of solving the problem will be preferable.

You mentioned some of the features already. Let’s look at the technology behind them. What's the technology behind Recolor? How does that work?

There are multiple aspects. First of all, you need to find the relevant object, and you need to segment it from the rest of the image content. For example, I might want to change the color of a shirt to red but I don't want to change the color of the shoes or the model’s torso too. So, we need to segment the right object. Once you have the segmented object, you need to recolor it, but you can't just flood-fill the object with a new color as that will basically ruin the object itself. You'll just get the basic shape without contours or detail. So what you need to do is to understand what are the range of colors that exist within that object, what are the colors that you want, and devise a transformation from this space to that space in that way the various shades of that color also transfer. What you get is something that maintains all the folds and all the various intricacies of the object.

Some of these techniques have been out there for a long time, but as with everything out there, things are improving and getting more accurate and we can handle more complex cases, and the ability to do fine-grained segmentation also helps. You don't need the green screen concept anymore because you can switch the color of an object without needing to change the color of the background. So you can have a complicated scene and still replace the color of a single object.

Is it a similar process for object replacement?

Object Replacement basically is a combination of detecting the object, then you need to erase it, which is called "in-painting" in the technical lingo and then you need to generate a new object. For that, we use diffusion models, as everybody does right now. There are various subtleties here because you want the new object to make sense in the context of the picture and so it needs to be the right size. There are various types of prompt engineering that you need to do along with fine-tuning or handling background versus foreground, etc. We've taken the open source models and built specializations on top of them to help with the particular use case.

The goal is to try to create something as robust as possible, so that you'll be able to use it to automate the process at the scale of modern business, and not like an artist that needs to look at every result and interact with it before it can be shipped outside. Of course, some of these generative tasks are easier than others. If you want to generate something new, it's higher risk than if you just want to erase something.

What have you found are the biggest challenges using this technology in terms of image and video manipulation?

There are multiple ones. The usual one when you're working with machines is explaining to the machine what you actually want. Usually the machine does what you told it, and that's what you wanted it to do. So the whole notion of translating a concept from what the user had in their mind to something that the machine can understand is always a hard thing. It's better these days, because the work that has been done in these models. Clip was a major advance where they took a lot of images and text prompts and had the network learn the correlations between what's happening in the image and the text that's associated with it. So it's much easier. But it's still an issue.

Another example: let's say I want to expand the canvas, the concept of “generative resize”. If you don't give the AI any details, it will put something that it knows makes sense compared to the current image. So if there is a person in the image, you might get another person in the expanded background. But if that’s not what you wanted, how do you convince the machine not to create a person even when it makes perfect logical sense that you would see another person if you would expand the canvas? That is a very tricky thing. The other is reliability. You can have something that works well for a lot of the cases. AI-based algorithms don't work in a hundred percent of cases. But in most algorithms when you make a mistake, you know what the mistake could be. If you compress an image and compress it too much, you get all sorts of issues, such as "blockiness", banding, and all sorts of other issues that are very familiar to us. In these generic algorithms, a mistake can be a boogie man that appears inside your image. The type of mistakes that it can make is essentially unbounded. That's a problem, because while it is acceptable to take on some risk, some risks are too much. Further, how do you train a model so that even if it makes a mistake, the mistake is not just unintended but actually problematic? That's very tricky too.

Another problem is cost. These things run on GPUs. These are costly algorithms and we want to have them done at scale so we need to optimize, we need to make sure that we're choosing the right trade-offs around that. There are also various concerns around copyright and legal considerations. There’s a lot of work being done to understand what is ok in this environment. We take this seriously and are doing things to reduce the risk. At the same time, there are various lawsuits happening right now regulatory bodies trying to figure out this new technology and how to responsibly respond.

At the end of the day, Generative AI technology presents an incredible opportunity to execute on creativity and imagination like never before. You need someone to ask the right question: can we do this and that? So that you know, somebody will look into that problem and figure out, “yeah, we can actually do that right now.” You need to make sure that your ideation is wide enough. Otherwise, you just keep doing basic stuff.

What are the main target markets for you and how big do you believe they are?

Market size is always a hard question. The target market for cloud in general is any company that has images or video, which basically is every company right now. Whether it's in the product itself or in the marketing organization, everybody uses images of video, because this is how we communicate. All those old-fashioned organizations that used to communicate only in text are not doing that anymore as it doesn't make sense. So everywhere you have a marketing organization, you have a need for images and video. So we have digital asset management products that are for marketers to organize and manage images and videos and we have developer-oriented products that help developers solve whatever problem they encounter when handling those assets. We have customers in every vertical that you can think of, except military.

A key segment for us is E-commerce. E-commerce companies are creative. They have a discerning target market that they need to fight hard to win the attention of. These brands are very visual in nature. They also know that they need to spend money to make money, so they are open to paying for solutions - whether for post-processing of images, repurposing of images, or any other use case. It’s even quite easy to imagine a future in which they replace entire photo shoots with virtual photoshoots. You can see the first glimpses of that in our Final Touch product, where you can take a single product image and place it in a variety of captivating scenes.

We also see significant traction in the travel industry, and another fun one is the dating industry. Any brand that needs to manage, transform and deliver a lot of images and videos is a natural user for our solutions. Of course, not every use case is relevant for every target market. For example, replacing an object with a different object would be a bit tricky to do in a way that makes sense to the customers of a dating site, but upscaling the quality of a user-uploaded image is definitely something that you would want to enable, so we provide a plethora of tools, and each customer chooses the ones that are relevant for them.

How much are you using Generative AI at the moment? Can you put a number on it?

I can’t say definitively. What I can say is that we started offering these capabilities earlier this year and we already have hundreds of customers that use them daily. We also have things like Final Touch, which are a separate product. We have hundreds of customers using that, so it's still a small percentage compared to our broader customer base. But we're starting to see bigger and bigger companies interested in what they can do with these types of technologies.

How are you going to fund all this usage of Generative AI given the costs are going up? Will you need to change how you charge for usage?

Most of these features are priced in such a way that they at least don't lose money. To do that we need to be smart. We need to be optimized. We need to make sure that we properly price things. And we need to make sure that when the technology improves, we improve with it. We have been fortunate to reach “centaur” status, meaning we have more than $100 million in revenue and a valuation over $1 billion, without having ever taken outside funding. If we will need a substantial investment, then we're not against raising funds. But we need to do it for the right reasons, because there is a particular bet that's worthwhile to do, and not just to fund the ordinary course of business. We believe that our business decisions are grounded in the principles of sustainable growth. Currently, we are able to generate the revenues we need through the normal course of business. This means we’re able to charge reasonable prices for the features and value we provide while generating

Have you in any way changed your pricing structure to take account of the fact that LLMs are charging you in a particular way?

So currently, we don't have many LLM-based features. We have the Transformation Builder that I talked about. It's an interactive tool and when people are using it, the overall cost is not significant so it's not an issue.

With the more visual aspect, we have a notion of add-ons that users can purchase in addition to their usual plan. Users pay for a transformation and if it's a very complicated, generative-based transformation, then they just pay for more than one transformation.

We are also exploring other pricing models for various aspects of the usage. We're currently experimenting with some of these, and when they're proven to not only serve our needs but also the needs of our customers, we'll release them to a wider audience.

How do you compare to other market participants such as Runway, DALL-E, Midjourney?

There are some amazing things that they are doing, especially Midjourney. Every time we read the news it seems there’s another layer of awesomeness to explore. It does look like DALL-E and Stable Diffusion are catching up, but Midjourney doesn't stop either, so this type of competition is amazing because it drives the market forward.

In contrast to Cloudinary, Dall-E and Midjourney are more into the text-to-image or text-to-video world. While we have such a feature, that's not the main thing for us. What we're trying to do is solve a particular use case for our customers. It's not just “here's an image” or “here's a prompt, do whatever you want”. We're trying to find a solution to the exact problem or need that the customer has and work in that direction so we can simplify and accelerate the fine-tuning that our customers need.

We have a lot of experience with working with developers and we deeply understand their needs. All three founders are developers. On the other hand, we have years of experience delivering solutions for the world’s most innovative and engaging enterprises and we understand their needs and try to build tools that will be maximally useful to them.

You mentioned that the copyright legal issue earlier. Is this something that directly concerns you or is it more of an issue for the actual creators of the other LLMs?

The bottom line is that where there is a complicated legal situation, nobody really knows, for sure. What we're trying to do is minimize risk. There are some rulings in the Israeli Ministry of Justice regarding fair use. There are all sorts of legal opinions emerging as to what is safer and what is more risky in terms of that use. We're trying to adhere to best practices to try to minimize the risk. For example, if you try to generate a new work in the style of an existing artist, that's a problem and probably that is something that will need to be regulated at some point, either by saying that this is wrong, or by saying that the artist has a copyright claim on what you've created, or by having a mechanism similar to what we have in music where the artists get paid for somebody using their style.

We don't allow you to generate images in the style of particular artists. And then that problem doesn’t exist. When you create real world objects like our e-commerce customers do, then it’s better than when you're doing these more abstract things.

In terms of video, specifically TV and filmmaking, we recently had the strike in Hollywood partially due to the rammifications of Generative AI. From your point of view, how do you see this working out in the future in terms of the filmmaking industry?

It's a really great question. There was a film called “The Congress” with Robin Wright and what happens is that an artist, the actress, is getting older and sells her likeness to the film industry and then many more films are created with her likeness without her participating in any way. That was 10 years ago so this notion has been percolating for a while.

I think that we'll see a lot more AI generated works both in the visual side of things and in the writing. I'm pretty sure that as we speak there are screenwriters that are using LLMs to help them work. I think that as long as what's being done is using it as a tool to make someone more effective then we're in a pretty good situation.

Where it gets complicated is when you don't need that person anymore and you can use the LLM and that is good enough. I think that for the visual aspects, it will take a while until a whole person could be fully replaced as there's still so much nuance in the way that humans behave. But as for writing, I'm pretty sure that it's already happening.

I don’t know if there will ever be a case where we'll just see a whole genre of movies where a human has just said here is a nice idea, and everything else is made by machines. It might happen at some point but I don't think we're very close to it. What we'll be seeing is humans, augmented by AI, not AI replacing humans just yet.

There are a couple of things to think about in that respect. We're already seeing a lot of mediocre content created by various AI tools flooding the feed in places such as Instagram or Facebook since they are easy to create and they look good. They create engagement but they have no real value in them: no human idea or thought or place in the world or something that has substance.

And that's going to be an issue, because this flood means that it's hard for you to find things that are meaningful. I can definitely see a situation where you have tons of movies. These billion-dollar movies that are now done once in a while because they're so expensive, and now you could do them in a day. But you can't see 10 movies per day. So how do you navigate? And what is worthwhile? That's a big, interesting problem that we will need to solve because the future is coming and it's very hard to stop it. All that you can do is be more prepared for it and make sure that you don't lose people along the way.

Where are we heading with the future of Generative AI for media?

The current rate of change is mind boggling. Everywhere you look, a new open source LLM appears that is stronger than the previous one, and this is happening every few days. You can see that with Midjourney and ChatGPT, and everybody's moving very, very fast. You have to run as fast as you can just to stay in the same place. It's exciting and exhilarating but it's also a big challenge.

© 2024 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Cloudinary #Generative AI #Large Language Models (LLMs)#Open-Source #Tal Lev-Ami #Text to Video #Text-To-Image