Commit Graph

13373 Commits

Author SHA1 Message Date
George Hotz
cf3f67e0e5 more crap 2026-05-21 13:48:22 -07:00
George Hotz
ea3dd000d3 more movement from count 2026-05-21 13:28:05 -07:00
George Hotz
ded5cdf2ea no dtypes.count in renderer, use shape 2026-05-21 12:52:23 -07:00
George Hotz
6815f28849 dtype.vec shapes (#16287)
* dtype.vec shapes

* something

* Closer

* more passes

* shape is in spec

* fix reduce

* image dtype shape correct

* lil

* use reshape on image

* need BUFFER there

* remove that test

* fix ptx + x86

* fix nir

* x86 fix maybe

* x86 fixups

* x86 fix

* don't check that for NOOP
2026-05-21 11:56:49 -07:00
wozeparrot
afc5bfa183 llama: remove fused grad accum (#16301) 2026-05-21 09:38:40 -07:00
nimlgen
a321700baa hcq2: multi prereqs (#16304) 2026-05-21 17:00:52 +03:00
qazal
e33e058d34 set SPLIT_W13=0 for 8b DP by default (#16302) 2026-05-21 22:09:10 +09:00
Christopher Milan
dd279ee25e print dtype decomp warning in DEBUG=2 (#16300) 2026-05-20 22:08:48 -04:00
George Hotz
ec547250ef don't use dtype vec for image idx (#16298)
* don't use dtype vec for image idx

* double gate

* y/x confused

* upd

* fix nir

* simplify_valid_image_load
2026-05-20 18:45:13 -07:00
Christopher Milan
172f9493e1 move is_dtype_supported to renderer (#16226) 2026-05-20 21:19:37 -04:00
chenyu
d548f8d0f3 use clone instead of unique_const in allreduce [pr] (#16297) 2026-05-20 18:58:47 -04:00
qazal
9e88b08f93 x86: don't use id (#16296)
* x86: don't use id

* diff

* more minimal change

* unique
2026-05-21 07:36:40 +09:00
Christopher Milan
da07b28998 am: override smu 13_0_7 to 13_0_0 (#16292) 2026-05-20 18:14:30 -04:00
chenyu
beea4633fc UOp.clone [pr] (#16295)
generates the store after structure
2026-05-20 17:47:49 -04:00
qazal
a19fa2908f fix x86 nondeterminism (#16293) 2026-05-21 05:48:05 +09:00
George Hotz
58d58c1659 remove DEVECTORIZE (#16290)
* remove DEVECTORIZE

* fully remove DEVECTORIZE
2026-05-20 13:25:49 -07:00
wozeparrot
825f30bf18 llama: apply_grad saves memory (#16275) 2026-05-20 13:14:06 -07:00
nimlgen
a88feef40f hcq2: cleanups (#16278)
* s

* simpler

* simler
2026-05-20 21:48:50 +03:00
Philipp Braun
a01d5918af fix: qlinearconv quant params (#16234)
* fix: qlinearconv quant params

* fix: simplify reshape

---------

Co-authored-by: Philipp Braun <braunphilipp@users.noreply.github.com>
2026-05-20 11:31:41 -07:00
George Hotz
19535df53c enable broadcasting in _shape (#16285) 2026-05-20 11:21:51 -07:00
chenyu
4dbe6a2ee7 remove _force_unique from Tensor init (#16277) 2026-05-20 14:13:05 -04:00
Christopher Bradford
fe2d8d1ecf filter by base_class in pci_scan_bus on macOS (#16282)
The Linux path of pci_scan_bus reads /sys/bus/pci/devices/.../class and
skips devices whose base class doesn't match. The macOS (IOKit) path
appended every IOPCIDevice unconditionally, so callers that supplied
base_class to narrow down to e.g. display devices would also get the
audio companion function of a multifunction GPU.

Concretely, an NVIDIA RTX Pro 6000 Blackwell exposes:
  10de:2bb1  class 0x030000 (display)
  10de:22e8  class 0x040300 (multimedia audio)

A PROBE for base_class=3 returned both. With the sorted() at the end of
pci_scan_bus, 22e8 (audio) came first, so the NV runtime picked the
audio function as device 0 and stalled on RESIZE_BAR.

This mirrors the Linux filter on line 70 using the existing read_prop
helper.

Co-authored-by: Christopher Bradford <christopher.bradford@joby.aero>
2026-05-20 20:09:35 +03:00
qazal
1e0fffe256 fused ce llama kernel in UOps (#16263)
* work

* using uops

* delete things

* work

* work

* higher level uops

* cleanups
2026-05-20 19:45:28 +09:00
chenyu
e1715b3b92 extent jit const error to deviceless inputs (#16276) 2026-05-20 02:02:45 -04:00
chenyu
170b857da9 clean up deviceless const _buffer (#16274)
process on CPU similar to multi
2026-05-19 22:47:45 -04:00
chenyu
7af7b6703a relax policy ASSERT_MIN_STEP_TIME to 3.2 (#16273) 2026-05-19 22:29:09 -04:00
chenyu
188d7ec15e clone can take device (#16271)
useful to materialize const on a specific device
2026-05-19 21:29:27 -04:00
wozeparrot
361553c0a8 llama: match flat_llama with model_train (#16269) 2026-05-19 17:25:56 -07:00
George Hotz
da7414d6dc fix RUN_PICKLE and test it (#16272)
* add test for openpilot RUN_PICKLE

* fix RUN_PICKLE and test it
2026-05-19 17:00:25 -07:00
George Hotz
55515747b7 Remove Ops.VCONST (#16267)
* start removing vconst

* remove a lot of vconst

* const folding + strict ordering

* update tests

* spec from minigen

* move that
2026-05-19 16:35:24 -07:00
Christopher Milan
7cdd9cbdeb PYTHONREMU: V_CVT_PK_BF8_F32 saturation (#16268) 2026-05-19 19:29:59 -04:00
Christopher Milan
bb2a51f1ea fix mypy mockgpu and add tinygrad.renderer.isa to packages (#16265) 2026-05-19 16:45:03 -04:00
chenyu
890b731b1e more prerequisuite test changed for deviceless const (#16264) 2026-05-19 15:43:45 -04:00
ttomsa
aa1e59ab97 X86 with Ops.INS (#14873)
* draft

* cleanup test_encodings

* cleanup test_isel

* model flag state and support rematerialization

* woops

* add vbroadcastss instruction

* don't fuse load if used multiple times in src

* add movabs instruction and fix idiv

* fixes

* add x86 backend to tests

* float16 fix

* rm TwoAddress2nd

* add BARRIER

* test windows ci

* yup isel fixes the mask stuff too and its beautiful

* add cmoves to the spec

* support storing imms

* no TUPLE_ORDER, breaks tests

* fix remaining seg faults

* add float max

* always fuse index

* minor

* fix DEFINE_VAR/SPECIAL and enable multithreading

* linter

* more linter

* more

* more

* more

* let's try this

* perhaps

* start new scheduler

* more scheduling info

* cleaner shuffle functions

* fixup isel tests

* skip bounds check when NOOPs exist

* skip inf rewrite tests

* fix const tag hack and add x86ops to _shape

* fix

* skip a few tests

* func arg order independent from op value

* x86 goes in own linearize

* switch to PARAM

* more

* add min x86op and neg in decomps

* do mulacc in isel

* use def_reg in test_encodings

* enable emulated int64 tests

* how much does this fix

* Ops becomes OpType

* fix

* rm noqa

* rm machine scheduler stuff

* and this

* allow for extending enums and move X86Ops out of uop

* fix imports

* rm X86GroupOp from ops.py

* spacing

* tell mypy to shut up

* more linter

* add x86op test

* allow set[X86Ops] in upat

* move NOOPs to pre_isel_matcher and rm NOOP from spec

* more asserts

* also this

* cleanup encode

* simplify live range

* fix idiv

* add Ops.INS to x86

* more changes

* more changes

* more changes

* fix

* fix

* fix

* fix

* print formatted assembly

* fix 8bit idiv?

* oops

* enable float16  and unaligned vector load/store

* actually no

* move x86 tests

* no more bool cast

* fix

* linter

* linter

* move X86Ops to x86.py

* fix vpbroadcast

* cleanups

* linter

* print correct reg names

* canonical max

* move max/min and add test

* support float16 vector load/store

* rm bad rewrite

* vpsrldq can't access memory

* regalloc takes renderer

* enable vector load/store on all dtypes

* more isel tests

* rm this for now

* a lot better

* fix

* fix

* fix

* deal with flags correctly

* fix

* enable gep noop rule

* fix

* fix

* fix

* add callee saved registers

* use Ops.CONST instead of X86Ops.IMM

* fix

* enable TUPLE_ORDER

* fix

* rm x86 code in linearizer

* fix

* fix

* fix

* move isa rewrites to codegen

* fix

* fix

* skip test_linearizer.py

* skip more tests

* fix

* fix for idiv/mod changes

* fix

* don't use fmadd if it duplicates fused op

* hacky

* fix

* cleanups

* cleanups

* fix

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-05-19 12:42:54 -07:00
George Hotz
b2e8102209 25000 lines for x86 backend 2026-05-19 11:27:41 -07:00
Sachith Shetty
74567c1958 fix: pass input device to ONNX helper internal tensors (#16242)
* fix: pass input device to onnx methods internal tensors

* test: onnx helper internal tensors use input device
2026-05-19 11:16:33 -07:00
Christopher Milan
a178301dbe PYTHONREMU: fix CDNA VOP3 conditional writes (#16258) 2026-05-19 13:31:31 -04:00
nimlgen
b3dcf8f452 hcq2: split into schedule/realize (#16216)
* hcq2: split into schedule/realize

* missing

* x

* f

* clean

* cleaner

* x

* x

* x

* x

* x
2026-05-19 16:40:17 +03:00
qazal
e4350e7de9 set hipcc mac docker to 7.1 (#16261)
* set hipcc mac docker to 7.1

* pull from amd
2026-05-19 21:30:39 +09:00
George Hotz
a120709671 tighten shape spec for broadcasting (#16206)
* tighten shape spec for broadcasting

* use IndexError, not ValueError

* needs size
2026-05-18 22:12:04 -07:00
George Hotz
3f2d401464 all tests pass with NOOPT=1 (#16257)
* all tests pass with NOOPT=1

* fix a few more

* noopt 100% pass

* noopt 100% pass
2026-05-18 20:39:51 -07:00
chenyu
e694d7f222 more deviceless const prerequisites [pr] (#16256)
* more deviceless const prerequisites [pr]

* remove that

* arange.contiguous -> arange.clone in tests

arange will become deviceless const soon, update tests where it needs to be a buffer
2026-05-18 23:14:12 -04:00
chenyu
c1076ed56c Tensor.device and UOp.device can be None (#16255) 2026-05-18 22:08:10 -04:00
wozeparrot
a3d59faef6 llama: don't save weight (#16252) 2026-05-18 17:05:45 -07:00
qazal
18b102f355 llama: also use 7.1 comgr, update startup_walltime.sh (#16253) 2026-05-19 08:59:02 +09:00
chenyu
d532b4f533 multi alu with deviceless const (#16251) 2026-05-18 19:31:53 -04:00
qazal
98b8a2b407 llama: use hipcc 7.1 version (#16250) 2026-05-19 08:09:57 +09:00
Christopher Milan
7515824a6d ci: actually use clang-20, enable bfloat16 (#16249) 2026-05-18 19:06:43 -04:00
chenyu
754344087a assign for deviceless const source (#16248) 2026-05-18 17:39:53 -04:00
chenyu
73e6b4963b to and shard is noop for deviceless uop (#16247) 2026-05-18 16:11:10 -04:00