Elorian: Moving AI beyond language

We’re about to pull off our most ambitious event yet in SF this Wednesday.

If you’re coming to Figma Config, join us, Bolt.new, and Notion for our experiential festival turned afterparty: Welcome to the Stratosphere.

Food, drinks, and robots. Immediately followed by a live performance by Cheat Codes, the award-winning DJ trio behind top songs like No Promises.

Spots are limited. Vibes are immaculate. See you there.

Tues. June 23 - AI Champions Dinner (San Francisco) - If you’re spearheading AI usage at your company, this curated, three-course seated dinner is for you.
Wed. June 24 - Welcome to the Stratosphere (San Francisco)

Arek and Ethan 🦄

Elorian is an AI research and product lab building models that reason through visual information.

The company’s thesis is that current AI systems are still too dependent on language. Today’s vision-language models often convert visual inputs into text before reasoning over them. Elorian argues that this approach starts to break down when the task depends on spatial relationships, physical constraints, or details that are hard to compress into words.

Elorian is building what its founder and CEO Andrew Dai calls “visual thinking models.” These are AI systems designed to reason in the visual domain rather than convert images into text first. The goal is to bring the kind of reasoning progress seen in coding and math into visual tasks that require the ability to interpret real-world scenes.

If models can reason visually instead of only describing what they see, Elorian believes they could improve work across engineering, robotics, medicine, science, weather monitoring, disaster response, and precision agriculture.

Check it out: elorian.ai

Elorian is pre-product, with plans to release a model API by the end of the year, focusing on technical teams building products where visual understanding is core.

The first commercial layer is expected to center on access to its visual thinking model. The company is already speaking with potential customers about pilots across video understanding, robotics, engineering, and other vision-heavy workflows.

Long term, Elorian may build a platform around the model, with tools that make visual reasoning easier to use inside existing systems.

Raised $55 million in seed funding from Striker Ventures, Menlo Ventures, and Altimeter, with participation from 49 Palms and prominent AI researchers including Jeff Dean
Emerged from stealth in April 2026 as a multimodal reasoning research and product lab focused on visual intelligence

In the summer of 2025, Andrew Dai started testing frontier models on something familiar: board games.

He often played with colleagues and friends, and after one game, he took photos of the board and asked the models simple questions like how many points a player had and how many resources were on the board.

The models struggled with questions a person could answer simply by looking.

Dai kept testing. He tried a New York Times crossword, then took a picture of a bar and asked what drinks were there, how many of each, and what needed to be refilled. The pattern held across tasks that required careful visual understanding rather than language.

That was the mismatch Dai kept coming back to. The industry was already talking about AGI, and frontier models could write, code, and generally reason through text. But when the task required basic visual understanding, they were still missing things that felt obvious to humans.

Dai had spent nearly 14 years at Google, including Google Brain and DeepMind, researching large-scale AI systems. Around the same time, Gemini and other frontier models were focusing heavily on coding and text-based reasoning, and Dai saw visual reasoning as an underdeveloped domain.

A coworker connected him with Yinfei Yang, who had worked as a research scientist at Google and Apple and had specialized in multimodal foundation models. The fit made sense quickly. Dai left around Thanksgiving, Yang followed in December, and they incorporated Elorian that same month.

AI has made enormous progress in language, but it still struggles with the visual world.

That gap matters because much of human intelligence is not text-first. People understand space, motion, physical constraints, and visual relationships before they can explain them in words. A model that describes an image is not necessarily a model that reasons through what is happening inside it.

Current vision-language models often rely on a translation step. They convert visual information into language, then reason over the text. Elorian argues that this chain is fragile because many visual tasks cannot be compressed cleanly into words.

That blind spot shows up in surprisingly ordinary places. Dai tested frontier models on board games, photos, crosswords, bar shelves, maps, mazes, counting tasks, and other problems where the answer was already visible. These models could see everything they needed to. Where they struggled was to reason through visual structure.

Recent research points in the same direction. BabyVision, a benchmark designed to test visual reasoning beyond language, found that state-of-the-art multimodal models still fail on basic visual tasks that even 3-year-olds can solve easily. Its results showed leading multimodal models performing well below human baselines, reinforcing that visual reasoning remains underdeveloped.

Generating an image and reasoning visually are different problems. A model can create a convincing picture without understanding whether a design works, whether a robot can move through a room, whether a metro map connects correctly, or whether a count is off by one in a safety-critical setting.

Elorian is designed to reason in the visual domain, building the model and the missing visual-reasoning dataset together from scratch.

The timing matters. Text-based reasoning models like OpenAI’s o1 unlocked major progress in coding and math, but that same reasoning shift has not fully reached the visual domain.

The team makes that argument more credible. Dai and Yang are leading researchers who pushed language, data, and multimodal models to the frontier, giving them a clear view of both the progress and the limitations from the inside.

Frontier labs are already pushing multimodal systems, and many companies will add stronger visual capabilities to existing models. The question is whether visual reasoning becomes a feature inside language-first systems or a separate foundation that needs to be built differently.

Elorian is oriented around the second view: the next major step in AI will require models built to reason visually from the start. The visual thinking data it needs is not sitting online, waiting to be scraped. If it were, the major labs would already be using it. Elorian has to build the model and the data layer together, and that is what makes its wedge sharper.

The opportunity is large for a different reason. The AI market is over-indexing on coding and math because those are the places where reasoning models first showed obvious progress. However, much of the real work is done visually too. Engineers design physical products visually. Traders read charts visually. Insurance teams evaluate damage visually. Similarly, robots need to understand the world visually before they can act inside it.

No one designs the next iPhone entirely in code or builds rockets through text alone. If AI is going to move deeper into the physical world, visual reasoning cannot remain a secondary capability. The bullish case for Elorian is simple: the world is visual, and most AI still reasons as if it is not.

Elorian AI: $55 Million Raised For Multimodal Reasoning Research Lab Founded By Former DeepMind Leaders [Pulse 2.0]
Former DeepMind Researchers Bet on Visual AI With New Startup [Bloomberg Law]
Elorian AI [LinkedIn]
BabyVision: Visual Reasoning Beyond Language [arXiv]
ARC-AGI Benchmark [ARC Prize]

Elorian

Is this week's company a future unicorn?

Keep Reading

Unicorner

Home