Kog is building the world's fastest AI execution layer, from the kernel to the model, on AMD MI300X hardware.

We are a team of 11 people, including 10 engineers and 5 PhDs, with a French Tech 2030 label, and a published benchmark on AMD's official blog.

Every role here sits at the intersection of fundamental research and production-grade execution.

The problem

Most inference systems scale by adding GPUs and paying a communication tax at every layer. At each transformer block, all GPUs stop, synchronize via all_reduce, and resume. This happens once per transformer layer, 40 times per token on a standard 40-layer model. No matter how many GPUs you add, that bottleneck travels with you.

At Kog, we refuse to pay it.

We co-design the assembly kernels and the model architecture around a single objective: overlap communication with computation entirely. The result is Laneformer, our proprietary architecture, designed so that all_reduce operations are deferred by one layer and hidden during layer computations. This is what makes the monokernel viable, and this is why our architecture team's work is structurally different from anything happening at standard labs.

Where we stand today:

Our first dense model beats Llama 3.2-3B on CORE-centered accuracy benchmarks at 2,500 tokens per second per request on AMD MI300X
Our second dense model targets 5,000 tokens per second
Our third model, a MoE, already shows emergent capabilities on structured reasoning tasks where all dense models of comparable size score zero
Speculative decoding multiplier of 1.86x validated, opening a path toward 9,000+

Three models in production or training. A validated architectural direction. A GPU team that extracts performance at the nanosecond level. The foundation is built.

We are now entering the design phase for Model 4, our largest flagship, and we are looking for the person who takes everything we have built and multiplies it.

The role

We are looking for a Lead Research Engineer to own the model architecture roadmap at Kog. You report directly to the CEO, manage a team of researchers and engineers, and are the person who decides how our next model is structured, trained, and evaluated.

This is a hybrid role. You design architectures, write training code, review experiments, and ensure the team ships models that run at speed on our engine.

Day to day, you own:

Roadmap definition and team delivery: you take scientific direction from the CEO, translate it into a concrete research plan, and are fully accountable for shipping models that hit performance targets
Architecture design for each new model generation: layer structure, MoE routing strategies, attention mechanisms, and the mathematical properties that enable communication-computation overlap on our hardware
The hardware-software co-design loop with the GPU team: you give kernel engineers the model constraints that determine what is achievable at speed, they give you the hardware properties that unlock new architectural strategies
Training pipeline ownership: convergence stability, distributed training efficiency, data pipeline design, and post-training optimization, including fine-tuning and speculative decoding
Experiment velocity: you structure the research process so the team runs rigorous, fast cycles rather than open-ended exploration

Who you are

You have trained large models from scratch, not just fine-tuned existing ones. You understand training dynamics at the level where you can diagnose a convergence issue from a loss curve before running ablations. You have made architectural decisions with hardware constraints as a first-class input, and you can point to specific choices that produced measurable inference gains.

You are either a research manager who has led a high-performance team at a top lab or tech company, or a senior/staff researcher ready to step into full technical ownership for the first time. Both paths are open. What is constant across both: you ship models, and contribute actively to papers.

Must-have:

Experience training LLMs or complex architectures from scratch, with a deep understanding of training dynamics, convergence stability, and distributed systems
Architecture-level fluency in Transformers, MoE, and at least one alternative architecture family (SSMs, linear attention, or equivalent)
Production-grade engineering in PyTorch or JAX: robust, scalable code that bridges research and infrastructure
Experience leading or mentoring a research team with accountability for delivery alongside individual contribution

Strong signal:

Prior work on architecture-hardware co-design or inference-aware training
Experience with post-training optimization: RLHF, DPO, speculative decoding, or quantization-aware training
Published research with implementation impact, open-source contributions to training frameworks, or a public benchmark result you own
Familiarity with AMD hardware or experience porting training pipelines across GPU vendors

Top 0.1%

You have designed an architecture that ran faster because of a specific structural decision you made, and you can explain exactly why. You think about model design and hardware constraints as a single problem. When you read that Laneformer defers all-reduce by one layer to enable compute-communication overlap inside a monokernel, you immediately understand the constraints that decision imposes on every layer dimension, every routing strategy, and every training dynamic downstream. You have a public trace of your work: a repository, a paper with real implementation adoption, or a benchmark result that others reference. At Kog, the measure of success is tokens per second on production hardware, and you have already started thinking in those terms.

What we offer

Competitive compensation at the top of the Paris AI market, with a meaningful BSPCE package reflecting your level of contribution.
Real compute access. AMD MI300X clusters at the scale you need to validate your work, available from day one.
High autonomy, peer-to-peer learning, and zero bureaucracy. Technical decisions are made by the people closest to the problem.
Remote-first with one week per month in Paris (WeWork, 13th arrondissement, near Station F) for strategic alignment, in-person engineering depth, and team time.

Lead Research Engineer

Job Description

Details