How to Generate Images and Videos with AI (Even If You're Not a Designer)

TL;DR

Media is harder than text. Only 7% of AI usage is multimedia, compared to over 50% for writing and practical guidance. The gap exists because visual output demands more specificity and more iteration.
Think like a director. You don't create the media. You plan it, break it into steps, and guide AI through each layer.
Three approaches work. Layer-by-layer image creation, programmatic video using code, and custom workflows that encode your brand and best practices.
Plan for iteration. Even OpenAI showcases images labeled "Best of ~16." Multiple attempts are the process, not a failure.
You don't need design skills. You need a repeatable workflow built around your brand, your standards, and the specific media your business produces.

Why AI Media Is Harder Than AI Text

Ask AI to write a blog post draft, and you'll get something usable in one shot. Ask it to create a branded product video, and you'll understand why media generation is a different category entirely.

The numbers tell the story. OpenAI's research on ChatGPT usage shows that multimedia accounts for just over 7% of all conversations, while writing and practical guidance together make up over 53%. People are generating text at scale. Media? Not so much.

The gap is visible at the organizational level too. McKinsey's State of AI survey found that 63% of organizations using gen AI create text outputs, but only about one-third generate images. Text dominates because it's more forgiving. A slightly off sentence can be edited in seconds. A slightly off image needs to be regenerated from scratch.

Why is media harder? Text is sequential and editable. Media is spatial, multi-dimensional, and far less tolerant of small errors. A misplaced element, wrong lighting, or inconsistent character breaks the entire output. Text AI needs to get the meaning right. Media AI needs to get the meaning, the composition, the color, the perspective, and the style right, all at once.

This is not a reason to avoid AI media generation. It's a reason to approach it differently.

Think Like a Director, Not a Designer

The biggest mistake business leaders make with AI media is treating it like text generation: write one prompt, expect a finished result.

Instead, think of yourself as a director. You're not painting the picture. You're planning the shot, setting the scene, and guiding the execution step by step.

This is exactly what working creators are doing. Wired profiled Josh Kerrigan, who produces an entirely AI-generated show called Neural Viz. His process looks nothing like "type a prompt and hit enter." He starts by writing: slug lines, action lines, dialog, camera movements. Then he storyboards each shot. For each panel, he creates a still image using tools like Flux, Runway, or ChatGPT. He manually ensures lighting consistency. During dialog scenes, he maintains sight lines. To get a handheld camera effect, he films his monitor with his iPhone and maps that motion onto AI footage.

"Everything I do within these tools is a skill set," Kerrigan told Wired.

Iteration is the process, not a sign of failure. When OpenAI introduced native image generation in GPT-4o, their own blog showcased outputs labeled "Best of ~16" for complex images and "Best of ~8" for simpler ones. Even the model's creators expect multiple generations before landing on the right result.

This matters beyond quality. The US Copyright Office ruled that purely prompt-generated AI art doesn't receive copyright protection. The distinction comes down to "the degree of human control, rather than the predictability of the outcome." Your direction, editing, and layering don't just improve the output. They're what makes it legally yours.

Three Practical Approaches That Work

1. Layer-by-Layer Image Creation

Instead of asking AI for a complete image in one prompt, break it into components.

The workflow:

Start with the background. Generate or select your base scene, color, or environment.
Add subjects one at a time. Generate characters, products, or objects separately. Use multi-turn conversation to refine each element.
Upload references. AI models with in-context learning can analyze your brand assets, existing designs, or style references and match them.
Compose and finish. Combine layers in a simple editor. Adjust spacing, alignment, and typography manually.

This approach works because you control each element independently. A bad background doesn't ruin your subject. A misplaced object can be swapped without starting over. You're working the way professional designers work: in layers.

2. Programmatic Video with Code

If your team needs repeatable, branded video content at scale, consider programmatic video creation using frameworks like Remotion, which lets you build videos using React code.

Why this works for non-video editors:

Reproducible. Change a data point, swap a product image, adjust a headline, and render a new video. No timeline editing required.
Templatable. Create one template and generate dozens of variations for different products, audiences, or campaigns.
Version-controlled. Your video assets live in code, which means they can be reviewed, reverted, and iterated on like any other software.

You don't need to be a video editor. You need someone who can write basic code or work with an AI coding assistant, and the output is production-ready video.

3. Build a Custom Workflow That Knows Your Brand

The first two approaches give you techniques. This one gives you leverage.

The biggest barrier to AI media generation isn't the tools. It's that every time you start from scratch, you're re-explaining your brand, your dimensions, your style, your constraints. That's exhausting and inconsistent. The solution is building a repeatable workflow that encodes your business context, so the best practices of media generation don't depend on your memory.

This isn't theoretical. McKinsey's State of AI survey found that, out of 25 attributes tested, the redesign of workflows has the biggest effect on an organization's ability to see EBIT impact from AI. High performers are nearly three times as likely as others to have fundamentally redesigned their workflows. The pattern holds for media: the companies getting results aren't the ones with better prompts. They're the ones with better processes.

What a custom media workflow looks like in practice:

Brand context baked in. Your colors, fonts, dimensions, tone, and visual references are pre-loaded into every generation step, not typed from memory each time.
Step-by-step sequences, not single prompts. The workflow breaks "create a branded video" into a series of defined steps: brief, storyboard, generate stills, compose, review. Each step has clear inputs and outputs.
Guardrails built in. McKinsey found that 27% of organizations already review all AI-generated content before use. A well-designed workflow makes review a built-in stage, not an afterthought.
Repeatable by anyone on the team. The workflow captures the expertise so that a junior team member, or even an AI agent, can follow it and produce on-brand results consistently.

McKinsey's research on agentic AI reinforces this: "It's not about the agent; it's about the workflow." Organizations that focus on redesigning the full workflow, rather than deploying a single tool, are more likely to deliver value.

This is where the real advantage is. The intimidating part of AI media generation, the granularity, the iteration, the brand consistency, becomes manageable once you encode it into a process that your team or your AI tools can follow repeatedly.

Before You Start Generating

Four things to get right before opening any AI media tool:

Define the specific asset. Not "we need content." What dimensions? What format? Where will it be used? What brand elements must appear?
Start with still images. Master image generation before attempting video. The skills transfer, but the complexity doesn't stack well if you try to learn both simultaneously.
Build a reference library. Upload your brand's existing visuals, color palettes, and style examples. AI with in-context learning performs dramatically better with references than with descriptions alone.
Budget for iteration. If a text task takes you one attempt, expect a media task to take five to ten. This is normal. Plan your timelines accordingly.

The Opportunity Is Process, Not Tools

87% of creators already use AI in their workflows, with more than 40% using it daily. The capability gap for image generation is among the smallest across all AI tools, meaning the barrier to entry is lower than you think.

The competitive advantage isn't access to AI tools. Everyone has that. The advantage is having a custom workflow built around your business: your brand guidelines encoded into every step, your quality standards enforced through review stages, and your best practices captured in a process that anyone on your team can follow.

You don't need to hire a designer or learn video editing software. You need to learn to direct AI the way a director leads a production, and then encode that direction into a repeatable workflow. The businesses that build this process will produce more media, faster, and more consistently than those who keep starting from a blank prompt every time.