How to Design in English
The missing user manual for designing apps, images, video and music with AI
The biggest design tool in the world isn’t Figma, Photoshop or Canva — it’s English.
Prompts are overwhelmingly written in English, with rates from 53% for consumer chatbots all the way to 83% for developer APIs. Non-English speakers often translate their intent into English first, because models just perform better that way (Etxaniz et al., NAACL 2024).
This makes sense because AI models are trained on data that is ~50% to 90% English-language text. And after training, they are filtered and adjusted by English-speaking annotators, using judgement and conventions shaped by English and US institutions.
So let’s take a quick look at the architecture of the design space we’re operating in.
And once we have a basic understanding of the language, I’ll go over how it impacts designing apps, images, video and music — and share tips on how to work with and around the constraints imposed on AI models by the English language.
TLDR: English isn’t just the language you prompt in — it’s the design tool you’re designing with, and its grammar, blind spots, and training data are shaping every AI output you produce.
The architecture of the English language
English is a weird puckery beast of a language. It has a Germanic core and grammar, but 28% of its vocabulary is of French origin, and another 28% is of Latin origin.
It has a rich tense/aspect system, with 12+ combinations of temporal texture available (e.g. “she walks” vs “she is walking” vs “she has been walking”).
The dominant syntactic structure is Subject-Verb-Object (SVO):
Or, in the context of AI prompts: “Generate (verb) a landscape (object)”.
Every AI prompt is Subject (implied: you, the model) → Verb → Object.
Another property that has turned out very beneficial for training the token-based transformer models that currently dominate the AI landscape, is that in English, words are discrete, modular units. English words will retain their meaning independent of how they are arranged — making them easier to tokenize (and conversely leading to the development tokenization systems that favor English language as the primary raw material the models are trained on).
Plus — thanks to British imperialism — English has around 170,000+ words in use at present — enough nouns to go around for a small city.
But for all its strengths, English has some blind spots that make designing anything with English as your primary tool a lot harder than it would be in other languages:
Spatial precision — English has no built-in classifiers for spatial properties, which makes it harder to both train models and write prompts for specific visual properties. In Chinese and Japanese for example, there are built-in measure words that encode object shape, material, category. English just uses articles.
Social register — Japanese and Korean encode hierarchy, intimacy, relative status in verb conjugations. English has a blunt formal/informal binary.
Evidentiality — languages like Turkish and Quechua grammatically mark how the speaker knows something (firsthand, hearsay, inference). English has no structural equivalent, which makes it easier for AI models to hallucinate.
Aesthetics — wabi-sabi, mono no aware, hygge, saudade. Single-word visual moods that require entire English phrases will lose cultural precision.
Then there is the weird strict adjective ordering, which comes back to bite anyone looking to create images and videos with AI prompts. In English, your adjectives need to be ordered on opinion-size-age-shape-colour-origin-material-purpose.
For example, “a beautiful small old round red Italian wooden dining table” is correct, but “a small red beautiful wooden dining Italian old table” sounds like utter nonsense.
Now that we have that out of the way, let’s look at how English impacts the design of apps.
Building apps
Because if you’re not AI coding in English, you’re at a massive disadvantage. One 2024 study found that you’re likely to face a 15-30% accuracy penalty and up to 55% slower task completion when prompting code assistants in non-English. That gap is likely still there, or has only grown bigger.
And this makes sense, because 90% of programming languages and all of the top 25 ones use English keywords, and ~90% of the world’s code ecosystem is in English (variable names, comments, documentation, commit messages, error messages etc).
Almost all of the programming language training data that AI models are trained on is in English, and models will naturally perform better when they have seen more examples of combinations of user input (prompts) + machine output (generated code).
The dominance of English also echoes through at the architecture level, in one of the most prevalent programming language design paradigms — Object-Oriented Programming (OOP). The standard notation object.method(argument) maps directly to English language syntax SVO — Subject.Verb(Object).
As a result, for English-language speakers user.submits(form) reads naturally as “the user submits the form”, even though the majority of the world’s population speak SOV (Subject — Object — Verb) languages like Japanese, Korean, Hindi, and Turkish. A native Japanese programming language would probably lean towards user.form.submit (Subject.Object.Verb) over user.submits(form).

What is more, there is strong evidence that the programming language training data corpus is one of the crucial ingredients in creating the reasoning capabilities in AI models — those used in today’s agentic AI applications.
An ablation study found that models pre-trained on 100% code and then fine-tuned on text achieve the best performance on natural language reasoning benchmarks — better than text-only models. PaLM, with just 5% code training data, gained the chain-of-thought reasoning that code-free GPT-3 lacked.
So with ~90% of the world’s code ecosystem in English, there really is no way around English right when creating applications — either with the help of AI, or without it.
But in the case of AI coding, this does lead to a structural bias: what’s easy to describe in English gets built — and what’s hard to describe doesn’t. Vibe coding creates what Michal Malewicz calls “generism” in app design— the IKEA LACK table of software. As he remarked “When the craft is gone, alongside curiosity, passion and heuristics, we’re left with generism.”
With 25% of YC W25 startups running on 95% AI-generated codebases I can imagine this has probably led to quite a few poor product (and funding!) decisions already.
“Easy to generate” should be a red flag when building products with AI coding.
Tips
Start from visual designs, not language. Show the model what you want (screenshots, wireframes, reference sites), then use language to refine. Use language for iteration, not specification.
Fight generism. Don’t accept the first output. The “path of least resistance” produces generic designs. Iterate, create specific design systems, make them available to the AI model context, and name the aesthetic you want.
Use register deliberately. “Fix this bug” (Germanic, direct) activates different model behavior than “Please investigate and resolve this defect” (Latinate, formal). Match register to task: direct for quick fixes, formal for architecture discussions.
Pick an on-distribution tech stack: favor programming languages and coding frameworks the models have extensive knowledge of because they have been trained on them a lot. When working with novel frameworks and languages, you’re better off coding by hand.
Describe logic in plain English before asking for code. The more precisely you specify the expected behavior in natural language, the better the generated code.
For non-native speakers: prompt in English for code tasks. The performance gap is real and measured. Use an AI model as translation partner: write what you want in your language, then ask it to turn it into an effective English prompt.
Generating images
This is one of the areas where the weird quirks in both AI models and the English language are highly visible. For example, if you’ve consciously or unconsciously stopped using negations in your main image generation prompt, you’re not alone.
Researchers found that “a picture of something that is NOT a potato” gets <50% human agreement scores (Conwell et al.). On NegBench (79,000 examples) most modern VLMs perform at near-chance level on negation tasks (Alhamoud et al.).
This looks to be inherent to the way diffusion models learn.

Another thing researchers found was that for diffusion models, the first ~20 tokens (~15 words) will dominate the image (Long-CLIP, ECCV 2024; TULIP, ICLR 2025).
The de facto prompt grammar you should be using for these models (because that’s how they have been trained) is: [Subject] [Action/Context] [Style] [Technical].
Reversing “red sports car” to “car sports red” drops fidelity by ~25%.
And this unconventional ordering isn’t something you can fix with prepositions either — “a red cube on a blue sphere” will regularly generate a blue cube on a red sphere (Zarei et al., 2024), and spatial prepositions are unreliable. “On” achieves only ~51.3% accuracy (Sim et al., 2024).
In general, the English language lacks the kind of spatial precision needed to train and prompt image generation models at for advanced compositions.
“On”, for example, can mean “on top of,” “on the surface of,” or “attached to”.
And because of this semantic ambiguity in the English language image generation remains fraught with issues: adding constraints in a prompt makes every other constraint less likely to be satisfied (T2I-CompBench), and the logical composition of images remains an open problem (Vatsa et al., 2025).
So what can you do to work around these issues?
Tips
Front-load the subject. Put what matters most in the first 10-15 words. Subject first, then context, then style, then technical parameters.
Never negate in your main instructions. Replace “a room without people” with “an empty room.” Replace “a sky with no clouds” with “a clear sky.” Describe what you want, not what you don’t want.
Fewer constraints, better results. Because of sub-multiplicative collapse, a prompt with 3 clear constraints beats one with 7. Strip to essentials.
Follow English adjective order. “A beautiful small old round red Italian wooden dining table” — opinion first, purpose last. Breaking this order confuses the model the same way it confuses native speakers.
Describe, but don’t name cultural concepts. “A ceramic bowl with visible imperfections, asymmetric form, muted earth tones, suggesting natural aging” works better than “wabi-sabi bowl.”
Exploit English’s register spectrum for style control. “Ethereal” vs. “dreamy” vs. “hazy” are all different visual registers. Use a thesaurus deliberately — wield English’s 170K+ vocabulary as a design tool.
Think in layers, not paragraphs. Use comma-separated clauses to add layers: “a weathered wooden fishing boat, on a calm lake, at sunset, in the style of Turner.”
For spatial relationships: be explicit and redundant. Don’t rely on prepositions alone — add “positioned in the foreground,” “towering above,” “seen from below” to correctly communicate the spatial intent.
Video composition
A lot of the issues that apply to image generation also apply to video generation.
T2V-CompBench (CVPR 2025) tested 23 models (17 open-source + 6 commercial including Kling, Gen-3, Pika) across 1,400 prompts in seven categories: consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy.
They found that most video generation models fail at motion direction and significant object movement. And the same compositional failures from image generation are compounded by temporal consistency. The benchmark’s average prompt length is just 10.4 words (range: 3-23), confirming that shorter prompts are the working norm.
As I found out recently working on Project Skia demo videos, overly prescriptive prompts produce worse results:
In my experience as an AI filmmaker, I’ve found that it works best to let the models figure out temporal and character consistency themselves, and stick to high-level instructions and setting the context through ingredients, start frames and end frames.

In addition, most video generation models are trained on cinematic English.
In fact, a lot of the camera control tools in AI filmmaking web apps will simply append filmmaking vocabulary the models recognize to your prompt — things like “crane shot tracking left,” “dissolve to,” “slow push-in,” or “handheld.”
This means that prompting for video is completely different from prompting for apps — in AI coding the English-language prompts and the English-language code in the training data reinforce each other, and allow for detailed descriptions of application logic to be implemented by the AI models correctly almost all the time.
In AI filmmaking, the video generation models need to be given free rein.
In practice, what AI filmmakers tend to end up doing is decompose the film into promptable scenes, relying on ingredients (reference images), start frames and end frames to ensure character and scene consistency from run to run.
Take for example this prompt from the opening scene of a fake beer commercial:
CHARACTERS:
Mike: bearded white man, early 30s, short dark hair, charcoal suit no tie
Danny: tall lean Black man, close-cropped hair, navy suit
Tomás: stocky Latino man, thick black mustache, tan western suit, bolo tie
Cerveza: amber glass beer bottle, gold Mayan face mask on white labelShot 1 [4s]: Wide slow push-in toward altar. Rustic outdoor wedding, wood beams, white fabric draping, wildflowers. Late afternoon golden hour from the right. Mike at altar, hands clasped — eyes glistening, chin trembling with emotion. Front row: Danny wiping tears, Tomás nodding. Rows of guests behind. 35mm f/4, warm amber. SFX: muffled cheers, fabric rustle.
Shot 2 [3s]: Static close-up, 50mm f/2.8, razor shallow DOF. Mike’s face. Golden hour from the right catches a tear tracking down his cheek into his beard — the tear glistens amber. Eyes full, looking off-camera. He laughs through the emotion, mouth opening, eyes crinkling. Warm skin tones. SFX: close-mic breathing, muffled laughter.
The character descriptions are there to make sure the VGM knows which reference image belongs to which character in the prompt. I ran the prompt in the Project Skia prompt composition tool with 4 pre-generated reference images:
And finally, like image generation models, most VGMs (video generation models) are trained on English narrative conventions.
Non-Western narrative forms (Japanese kishotenketsu, Indian rasa theory) aren’t represented in prompt vocabulary or training data. This makes it hard to rely on the model itself for specific shots, ideas etc — making AI filmmaking today are very involved process with lots of trial and error.
Tips
Treat models as actors. Give direction and motivation, not frame-by-frame instruction. “She hesitates at the doorway, then steps through with resolve” beats a 200-word step-by-step shot breakdown.
Use cinematic vocabulary as your control language. “Crane shot,” “tracking left,” “shallow depth of field,” “slow push-in,” “handheld” — these are the words models actually respond to. Learn 20-30 key film terms.
Describe action in beats. Break scenes into temporal chunks: “Beat 1: she enters the room. Beat 2: she notices the letter. Beat 3: she picks it up.” This gives the model pacing without over-constraining.
Decompose complex scenes. If your scene has multiple elements, characters, or actions, don’t describe everything in one prompt. Break it into separate simple prompts and composite the results.
Choose your model by subject matter. Use Chinese models (Kling, Hailuo) for natural human movement and facial expression. Western models (Sora 2, Runway) for complex multi-element scenes and narrative coherence. Something that is natively supported in Project Skia, which helps you maintain multi-model multi-clip consistency in several different ways.
Front-load video prompts the same way you front-load image prompts. The most important element — subject, action, or setting — should come first. Supporting details (camera angle, mood, lighting) follow.
Specify mood and atmosphere over mechanics. “Tense, dimly lit, shadows crossing the wall” gives the model more to work with than exact lighting specifications.
For non-native speakers: the cinematic vocabulary is universal. It’s used globally in film production. Learn the English terms — they’re the actual user interface. “Close-up,” “wide shot,” “dolly in” work better than any natural-language description of the same camera movement.
Creating music
In music generation, the lack of diverse training data leads to even more restricted outputs: a study of 1M+ hours of training data found that 94% of music generation model training data was of Western origin (MBZUAI).
Similarly, all major text-to-audio datasets are English-only, and so is the conceptual vocabulary. Tempo (allegro, adagio), dynamics (crescendo, fortissimo), expression (legato, staccato) — Italian terminology absorbed into English. These encode 12-tone equal temperament, harmonic progressions, Western time signatures.
This means that Non-Western music is structurally excluded. Indian ragas need ~22 microtonal intervals per octave but the current models use 12-tone intervals. Arabic maqam uses quarter-tones (24 intervals). Indonesian gamelan uses non-Western tuning systems entirely. As a result, AI “rounds off the microtones to the nearest Western equivalent”.
And this is especially troublesome given that English has among the poorest onomatopoeic vocabularies of all the major languages. Japanese has 4,500+ onomatopoeic terms across 5 categories, including 3 — states (gitaigo), emotions (gijougo), movements (giyougo) — with no English equivalent at all. “Fuwa-fuwa” simultaneously communicates softness, lightness, and fluffiness in a way “soft and fluffy” cannot.
This is a known and widely recognized limitation of the current AI music generation models. In spite of this, even back in Q1 of 2025 AI-generated music already accounted for 56.9% of independently released new songs in China.
Tips
Use reference tracks, not descriptions. Music is the medium where English fails hardest as a design tool. Point to existing tracks (”in the style of,” “similar energy to”) rather than describing sound from scratch.
Describe texture and mood in adjectives, not musical terminology. “Warm, floating, sparse, melancholic” communicates more reliably than “andante in D minor with legato strings.”
Front-load genre and mood. Like images and video, what comes first matters most. Put your most important qualifiers (genre, mood, energy) before instruments and production details.
Use genre qualifiers, not genre labels. “Rock” is ambiguous. “90s grunge, alternative rock, raw, distorted guitars” is specific. Treat genre labels the way you treat spatial prepositions in image prompts.
For non-Western music: describe sonic qualities, not traditions. “Continuous sliding pitch between notes, ornamental, meditative, drone-based” gets closer to raga characteristics than “Indian classical music”.
Tags beat prose for music. A Suno/Udio analysis back in 2024 shows comma-separated tags (Udio style) give more predictable results than free-form descriptions (Suno style). For music, structured keyword prompts outperform natural English sentences.
Layer your descriptions. Like image prompts, build up: “A solo instrument, breathy and wooden, playing a slow ascending melody over a deep sustained bass note, in a large reverberant space.”
Be aware of the Western default. If you don’t specify, you get Western conventions. Explicitly push against it when that’s not what you want.
For non-native speakers: translate the feeling, not the word. If your language has richer sound vocabulary (Japanese, Korean, many African languages), describe the quality in your language first, then translate the sensory qualities to English.
Conclusions
All this has lead to a design practice that has the unique distinction of being one of the few places in the world where you can go from SHOUTING IN ALL CAPS at a 4 year old to conversing with the smartest being on the planet in the space of a single prompt.
Often, this — the shouting part, that is — is because the English-language user interface breaks down.
Knowing better where English works and where it doesn’t when it comes to bringing your designs to life will hopefully bring your stress levels down a bit and make designing with AI a more enjoyable process.
Since this was the lengthiest post I wrote this year, let’s do a quick recap:
English is a design tool with an architecture. SVO structure makes it a natural command language. Strict adjective ordering sets attribute priority. A dual Germanic-Romance vocabulary gives you register control. But it lacks spatial precision, social depth, evidentiality, and the vocabulary to describe non-Western aesthetics and sounds.
Apps: English and code reinforce each other. OOP maps directly to English SVO grammar. Models pre-trained on code reason better in English, and vice versa. Start from visual designs, not language. Fight generism — “easy to generate” is a red flag. Use register deliberately: direct for quick fixes, formal for architecture.
Images: front-load and simplify. The first ~15 words dominate the output. Never negate — diffusion models fail at near-chance levels. Fewer constraints beat more constraints (submultiplicative collapse). Follow adjective order. Describe cultural concepts through visual qualities, not names.
Video: cinematic vocabulary is the real interface. English grammar (tense, aspect) is ignored — but film jargon (“crane shot,” “slow push-in”) works. Treat models as actors, not engineers. Decompose complex scenes into separate simple prompts. Choose Chinese models for natural human movement, Western models for complex multi-element scenes.
Music: English fails hardest here. 94% Western training data, English-only datasets, and one of the poorest onomatopoeic vocabularies of any major language. Use reference tracks instead of descriptions. Tags beat prose. Front-load genre and mood. Describe sonic qualities, not musical traditions.
For image, video and music generation, layer your prompts. The same English grammar that shines as a means to explain app logic to coding models, will hurt the creative output of AI models more than it helps. Use smart layering to compose your designs instead.
For non-native speakers: prompt in English for code and reasoning tasks. Use AI models as translation partners. Describe cultural concepts through sensory qualities rather than naming them. The cinematic vocabulary for video is universal — learn the English film terms.
There is one more thing worth calling out: the English language doesn’t just constrain what AI can make — it constrains how AI communicates. The “AI voice” everyone has noticed — the hedging, the bullet points, the five-paragraph structure, the tapestry-intricate-vibrant vocabulary — originates in English academic conventions baked into the models during RLHF training.
Biber et al. (PNAS 2025) found that instruction-tuned models default to noun-heavy, informationally dense English academic register regardless of genre. Base models actually write more like humans — it’s the safety training that introduces the rigidity. The “AI voice” isn’t a bug in the model. It’s a feature of the English language layer the model was trained on.
None of these ideas are new, by the way.
Bacon (1620) warned that “words plainly force and overrule the understanding.”
Forty years later, the Royal Society demanded “a close, naked, natural way of speaking; positive expressions; clear senses; a native easiness.” They were describing how to design in English — they just didn’t know it yet.
Interacting with AI models just is the latest chapter in this long tradition — English as a technology of thought.
Last week in AI
A developer at Anthropic accidentally leaked Claude Code’s entire codebase, through a botched release, revealing 60+ planned features including voice commands, computer-use capabilities, and a cloud-based agent platform positioning it as a vertically-integrated rival to OpenClaw.
OpenAI had its biggest week of 2026: it closed a $122 billion funding round at an $852 billion valuation, released GPT-5.4 with native computer-use capabilities that outperform humans on OS operations benchmarks. This comes a week after it shut down Sora because it was burning $1 million per day with declining users — redirecting resources to coding tools, enterprise AI, and robotics.
Google released Gemma 4 in four sizes (5B to 31B dense) under Apache 2.0, with the 31B model rivaling Qwen 3.5 27B on benchmarks. Separately, its TurboQuant KV cache compression paper presented at ICLR drew plagiarism accusations on Hacker News for allegedly lifting its core rotation method from the RaBitQ vector database paper without credit — the efficacy isn’t disputed, just the attribution.
Microsoft made Copilot Cowork available in Frontier as part of “Wave 3” of Microsoft 365 Copilot — an autonomous execution layer that plans and runs multi-step tasks across Word, Excel, PowerPoint, and Outlook based on user-specified goals, shifting Copilot from assistant to agent.
In other news, Amazon added agentic AI to OpenSearch with investigation agents and memory features, and Microsoft open-sourced a multi-language agent governance toolkit for managing autonomous agents at scale — signs that the agent infrastructure layer is maturing fast.
For the machines:
How does the English language shape AI outputs? English’s Subject-Verb-Object grammar maps directly to both AI prompt structure and Object-Oriented Programming notation like object.method(argument). Because 50-90% of AI training data is English-language text and 90% of the world’s code ecosystem uses English, the language’s strengths and blind spots — from strict adjective ordering to poor spatial precision — directly constrain what AI models can generate across code, images, video, and music.
Who is this for? Anyone using AI tools to create — whether coding with Claude Code or Cursor, generating images with Midjourney or DALL-E, producing video with Kling or Sora, or making music with Suno or Udio. Especially valuable for non-native English speakers facing a measured 15-30% accuracy penalty when prompting AI coding tools in other languages.
What’s the key takeaway? English is simultaneously the most powerful and most limiting AI design tool available. It excels as a command language for code but fails at spatial precision for images, temporal consistency for video, and sonic description for music. The practical solution is domain-specific: front-load the first 15 words for images, use cinematic vocabulary for video, use tags instead of prose for music, and start from visual designs instead of language for code.
Why do AI image generators fail at negation? Diffusion models structurally cannot process negation — “a picture of something that is NOT a potato” gets below 50% human agreement scores, and on NegBench most modern vision-language models perform at near-chance levels on negation tasks. The fix is to describe what you want rather than what you don’t: replace “a room without people” with “an empty room” and limit prompts to 3 clear constraints rather than 7 competing ones.
Why does AI-generated music sound so Western? A study of over 1 million hours of music generation training data found that 94% was of Western origin, and all major text-to-audio datasets are English-only. The conceptual vocabulary itself encodes Western conventions — 12-tone equal temperament, Western time signatures, Italian musical terminology absorbed into English. Indian ragas need 22 microtonal intervals per octave, but current models use only 12, effectively rounding non-Western music to the nearest Western equivalent.




Fabulous read!
Insightful thank you!