Kog is building the world's fastest AI execution layer, from the kernel to the model, on AMD MI300X hardware.

We are a team of 11 people, including 10 engineers and 5 PhDs, with a French Tech 2030 label, and a published benchmark on AMD's official blog.

Every role here sits at the intersection of fundamental research and production-grade execution.

The problem

Every LLM inference system in production today has the same architecture flaw: the CPU sits in the token generation loop. At every step, the GPU completes work, hands control back to the CPU, waits, and resumes. This handoff happens tens of thousands of times per second. Eliminating it requires co-designing the execution engine and the model architecture together from scratch.

That is what we did.

Kog built a monokernel that runs a single GPU kernel from the first token to the last, with the CPU entirely out of the loop. Our models are architecturally designed so that GPU-to-GPU communication overlaps with computation, removing the all_reduce bottleneck that limits every other system at scale.

Where we stand today:

2,500 tokens per second per request on AMD MI300X
Model v2 targeting 5,000 tokens per second
Speculative decoding multiplier of 1.86x validated, opening a path toward 9,000+

This is a different stack, built from different first principles.

The role

We are looking for a Lead GPU Engineer to own the technical execution of the Kog Inference Engine. You report directly to the CEO, manage a team of engineers writing assembly-level kernels on AMD Instinct hardware, and set the technical standard the team operates to.

This is a hybrid role. You write code, review it at the nanosecond level, define the roadmap, make architecture calls, and ensure the team ships.

Day to day, you own:

Kernel architecture for the monokernel pipeline: memory hierarchy design, wavefront scheduling, and weight streaming strategies to hide HBM loading latency behind computation
Hardware reverse-engineering of AMD GPU behaviors specific to the MI300X topology: XCD/IOD interactions, address hashing and mapping to physical location, register file exploitation, and wavefront scheduling edge cases that exist nowhere in the documentation. Profiling at the microsecond level is the primary source of truth.
The hardware-software co-design loop with the Architecture team: you give researchers the constraints that determine what model structures run at speed, and they give you the model properties that unlock new kernel strategies
Roadmap definition and team delivery: you take direction from the CEO, break it into concrete engineering milestones, and are fully accountable for shipping them

Who you are

You have written assembly or PTX/CDNA kernels for production inference workloads as the primary implementation path. You understand the AMD CDNA memory hierarchy at the level where you can predict cache behavior before profiling confirms it. You have shipped optimizations that produced measurable throughput improvements, and you can point to the specific decisions that created them.

You are either an experienced engineering manager who has led high-performance systems teams or a senior/staff engineer at a tier-1 compute company ready to step into full technical ownership. Both paths are open. What is constant across both: you manage people and write code in the same week, treating each with equal seriousness.

Must-have:

Production experience with GPU kernel optimization in C++ and/or Assembly on NVIDIA or AMD hardware (CUDA, PTX, ROCm, HIP, or CDNA)
Demonstrated ability to extract throughput gains from hardware through profiling, with specific decisions you can trace to specific results
Experience leading or mentoring a technical team with accountability for delivery, alongside individual contribution

Strong signal:

Direct experience with AMD MI300X or Instinct series hardware
Prior work on inference engine components: KV cache management, attention kernel optimization, quantization-aware implementations
Contributions to open-source inference systems (vLLM, TensorRT-LLM, or equivalent)

Top 0.1%

The best person for this role has a public trace of their work at the hardware level: a repository, a published benchmark, a conference talk, or a profiling methodology others have adopted. They understand that the binding constraint at Kog is the speed of the engineering judgment loop, and they shorten it.

What we offer

Competitive compensation at the top of the Paris AI market, with a meaningful BSPCE package reflecting your level of contribution.
Real compute access. AMD MI300X clusters at the scale you need to validate your work, available from day one.
High autonomy, peer-to-peer learning, and zero bureaucracy. Technical decisions are made by the people closest to the problem.
Remote-first with one week per month in Paris (WeWork, 13th arrondissement, near Station F) for strategic alignment, in-person engineering depth, and team time.

Lead GPU Engineer

Job Description

Details