Commit Graph

13294 Commits

Author SHA1 Message Date
George Hotz
770dac0e0d broadcast 2026-05-14 17:04:37 -07:00
George Hotz
b827858479 broadcast shape 2026-05-14 17:01:20 -07:00
chenyu
09096ea565 test_gradient_through_clone (#16203)
backward through clone crashes now
2026-05-14 19:26:47 -04:00
George Hotz
d4dcd8487b aggressive shape check to prepare for broadcasting (#16202)
* add implicit broadcasting to shape

* NOOP/ALLREDUCE fixes
2026-05-14 16:15:44 -07:00
George Hotz
83ec66da34 fix a fastdiv edge case (#16199) 2026-05-14 13:12:18 -07:00
nimlgen
62ea73719d hcq2: share more with graph (#16196)
* share more with graph

* comment
2026-05-14 22:28:11 +03:00
George Hotz
3b8cc31759 disable fast idiv by default, it's broken (#16197)
* disable fast idiv by default, it's broken

* fix fast idiv tests
2026-05-14 11:48:27 -07:00
Christopher Milan
8f811649ff better compiler_cpu invalid arch errors (#16194) 2026-05-14 14:36:14 -04:00
qazal
f03a7fd6d1 viz/cli: readable uop json (#16195)
* viz/cli: readable uop json repr

* work

* better
2026-05-14 21:33:10 +09:00
C T
1b779a9058 add gelu approximate="none" (match pytorch) (#16162)
* add gelu approximate="none" (match pytorch)

* lint

* pass through onnx Gelu approximate

* type annotate

* explicit math.sqrt

* keep tinygrad's gelu approximate="tanh" default
2026-05-13 18:53:24 -07:00
chenyu
dd9187d9ee minor hash cleanups (#16190)
same kernels
2026-05-13 20:59:24 -04:00
wozeparrot
88ac2ac1fd llama: cleanups (#16189) 2026-05-13 17:08:06 -07:00
Christopher Milan
9a365d9978 ci: fix null image tests (#16188) 2026-05-13 18:00:05 -04:00
nimlgen
ad1fb7c981 hcq2: graph (#16186)
* keep this for now

* early graph
2026-05-13 22:49:43 +03:00
chenyu
3f9f6a51b2 minor image_conv2d cleanup (#16187)
remove some no-op slices
2026-05-13 15:47:40 -04:00
b1tg
59c34b9fe0 llm: precise device (#16159)
* llm: precise device

* llm: pass device to precompute_freqs_cis
2026-05-12 21:16:42 -07:00
b1tg
3c806ff406 clean up gguf (#16160) 2026-05-12 21:16:10 -07:00
wozeparrot
e97f2c1114 llama: only gemm + fa custom kernel (#16180)
* llama: tie store to grad directly

* llama: set mp flags

* llama: non fused grad fp8 quantize path
2026-05-12 21:03:49 -07:00
chenyu
38d407fd58 simplify svd more (#16181)
all the slowness is scheduling
2026-05-12 23:48:22 -04:00
Christopher Milan
f1fdd2ccec ci: add IMAGE=1 compile-only tests (#16182)
* ci: add IMAGE=1 compile-only tests

* fix
2026-05-12 23:40:32 -04:00
George Hotz
faf7fb7513 update nir renderer for new image style (#16179)
* update nir renderer for new image style

* don't cast image indexes
2026-05-12 20:25:01 -07:00
Christopher Milan
7d0c5ab689 ci: ocelot needs nvcc on linux (#16178)
* ci: ocelot needs nvcc on linux

* cudart
2026-05-12 23:13:48 -04:00
chenyu
32138c2418 svd to mixin (#16175) 2026-05-12 22:29:01 -04:00
George Hotz
69e1f3b551 remove vec2 from image in gater (#16165)
* remove vec2 from image in gater

* only simple idx

* fix python with new image style

* fix vconst

* just vconst and stack

* cast to int there

* fix for const

* fix process replay
2026-05-12 19:25:52 -07:00
chenyu
2172363be5 don't use Tensor indexing in svd (#16174)
prepare mixin, also about 4X faster for 8x8 input
2026-05-12 21:56:19 -04:00
chenyu
420a08c6d1 qr to mixin (#16173) 2026-05-12 21:23:25 -04:00
chenyu
c6a82fe927 functional qr and svd (#16172)
no clone and setitem, will move to mixin next. slightly faster but still quite slow
2026-05-12 19:12:08 -04:00
Christopher Milan
3844a31f87 ci: untangle cuda/ocelot, less apt (#16171)
* ci: untangle cuda/ocelot, less apt

* ldconfig
2026-05-12 18:14:03 -04:00
Christopher Milan
316607f004 dsp: don't use docker in ci (#16167)
* dsp: don't use docker in ci

* add setup script for macos docker
2026-05-12 17:11:03 -04:00
chenyu
bdcdf1f1a1 jittable masked_select and nonzero (#16170)
* jittable masked_select and nonzero

make jittable with `size=`, matches jax

* COMPILE_ONLY
2026-05-12 16:39:36 -04:00
wozeparrot
a613bcfc6d allow after on contiguous in spec (#16169)
* feat: allow after on contiguous

* feat: add test
2026-05-12 13:11:44 -07:00
chenyu
7c3e3fa154 fix empty input for masked_select and nonzero (#16168) 2026-05-12 15:36:51 -04:00
chenyu
da3b7e89a4 atol in test_custom_kernel_multi_output_backward_interacting (#16166) 2026-05-12 14:42:12 -04:00
chenyu
25583f6dc1 fix cumsum dtype for 0d input (#16164) 2026-05-12 14:18:08 -04:00
George Hotz
64c81dfd24 add all codegen stages to spec_tensor (#16163) 2026-05-12 10:35:38 -07:00
chenyu
f3e3c3851f explicit args to Tensor.rand (#16161)
added requires_grad, other kwargs were silently dropped
2026-05-12 12:53:39 -04:00
nimlgen
e93fb5f9b9 hcq2: remove hcqprogram (#16157)
* hcq2 rm program

* nonbeauty

* no prog

* tiny

* f

* x
2026-05-12 18:49:13 +03:00
nimlgen
a708542308 fix ci spec (#16156) 2026-05-12 17:57:11 +03:00
nimlgen
e5729935c6 time_call (#16152)
* time_call

* x

* fix caches
2026-05-12 16:58:28 +03:00
qazal
fe39cf148a add Ops.SOURCE test (#16155)
* simple failing test

* raises

* change
2026-05-12 22:49:32 +09:00
qazal
5cd0494b14 viz: canonicalize ast for schedule to codegen linking (#16154)
* simple failing test

* always null device

* viz: canonicalize ast for schedule to codegen linking

* SCACHE
2026-05-12 22:40:21 +09:00
qazal
c1d125ff3b llm: add markers to --benchmark (#16153)
* markers in llm

* ui fix
2026-05-12 20:14:11 +09:00
wozeparrot
e9359d9e7d more llama mp fixes (#16151)
* llama: SPLIT_W13

* llama: fix with no fused kernels

* llama: cast to bf16 on non asm_gemm patH

* llama: new mp flags
2026-05-11 21:29:23 -07:00
chenyu
09fd80fba6 fix randperm and _multi_like drop requires_grad (#16150) 2026-05-11 23:23:34 -04:00
George Hotz
8294d105a7 Update the spec in spec.py to match the current state (#16132)
* start work on specv2

* more spec

* more spec

* fix amd emulator

* more spec

* more

* fix test_uop_graph

* move those

* spec=2

* skip those questionable tests

* ptx fix

* more spec=2

* store

* allow custom function in tensor

* spec 2

* fix beam search for tensor cores

* delete the old specs

* fix import
2026-05-11 20:07:47 -07:00
chenyu
3942a80f66 fix wrong kwargs passed into rands (#16149)
working towards explicit args for these
2026-05-11 22:22:06 -04:00
Christopher Milan
039d84ff02 Revert "onnx: deduplicate simple proto parsers" (#16148)
This reverts commit 83eaefcd0f.
2026-05-11 21:45:17 -04:00
Christopher Milan
20f587d5d5 nv: rm _download (#16147) 2026-05-11 19:56:37 -04:00
chenyu
371ab2023f clean up image_dot and image_conv2d (#16145) 2026-05-11 19:37:58 -04:00
Vikram Rangarajan
effa263865 Torch backend aten::cat.out fix (#16121)
* Handle empty 1D tensors in cat_out

* Undid other changes

* Fixed torch cat

* Improved cat.out, added more tests

* Cleaned code

* Type hinted dim

* Removed whitespace
2026-05-11 16:28:16 -07:00