Commit Graph

236 Commits

Author SHA1 Message Date
nimlgen
77965a22e5 local optimize as rewrite (#15953)
* local optimize as rewrite

* better

* x

* slighly rename

* fix

* ugh

* remove

* x

* remove

* not weak
2026-04-28 22:51:04 +03:00
nimlgen
4164666c72 programinfo (#15942)
* programinfo

* fix

* m

* x

* x

* changes

* x

* fix

* rm
2026-04-27 23:12:03 +03:00
nimlgen
bb652352c7 remove execitem (#15932)
* remove execitem

* f

* x
2026-04-25 19:33:04 +03:00
nimlgen
768106a542 remove schedule from extra/docs/examples (#15929)
* remove schedule from extra/docs/examples

* f
2026-04-25 14:09:12 +03:00
nimlgen
f2751955cb remove linear_to_schedule from tests (#15912)
* remove linear_to_schedule from tests

* x
2026-04-24 20:02:10 +03:00
chenyu
9192c93b7e Tensor.invalid -> Tesnor.invalids (#15849)
matches ones and zeros, and to not share name with UOp.invalid
2026-04-21 11:19:51 -04:00
nimlgen
bfe28ee2ad rm run_schedule (#15847) 2026-04-21 18:14:30 +03:00
wozeparrot
9e60e4a7e7 llama: native fp8 (#15733) 2026-04-16 22:16:05 -07:00
qazal
12c653a743 remove opts arg in get_program, everything uses opts_to_apply [pr] (#15767)
* check Ops.BEAM in process replay

* remove opts from the get_program api

* lint

* simplify

* cleanup
2026-04-16 22:42:43 +03:00
chenyu
3394d18066 size*itemsize -> nbytes (#15729)
and some UOp.size removal to prep for size to mixin change
2026-04-14 16:27:54 -04:00
wozeparrot
55bcd7cc9e llama amax outside (#15670) 2026-04-09 23:08:03 -07:00
George Hotz
48a7627b04 add RDNA4 support to copy WMMA (#15663)
* add RDNA4 supportt to copy WMMA

* simpler

* simpler

* comment

* assert
2026-04-09 22:48:20 +08:00
George Hotz
1ebeb52e59 RDNA4 asm gemm (#15427)
* sqtt: rdna4 decoder work

* diff cleanup

* more diff

* test

* 125

* r4

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-04-08 21:26:44 +08:00
wozeparrot
70dbd35023 llama: move custom_kernel into flat_llama (#15643) 2026-04-08 00:19:14 -07:00
wozeparrot
7e54992bf6 fp8 llama (#15588)
Co-authored-by: qazal <qazal.software@gmail.com>
2026-04-04 18:24:57 -07:00
Christopher Milan
0ed8d9271d Renderers accept Target or nothing (#15590) 2026-04-03 01:09:41 -04:00
qazal
fefb0ebc2a gemm/asm: fp8 cleanups (#15580)
* normal gemm here

* s/dtypes.fp8e4m3/FP8_DTYPE

* gemm_bw

* device UOp stays NULL
2026-04-02 19:02:38 +09:00
chenyu
1aa04eab08 simple CreationMixin (#15567)
start with full_like, zeros_like, ones_like
2026-04-01 23:00:56 -04:00
qazal
8feb8edc68 gemm/asm: add fp8 support to cdna asm_gemm (#15542)
* work

* hmm, mixins

* rhs_transposed

* also fix the dtype

* check for hipcc

* Exception

* select dev

* default
2026-03-31 19:32:54 +09:00
George Hotz
85dee83f5d amd flash attention cleanups + emulator fixes (#15431)
* amd flash attention cleanups

* simpler

* params

* fix emulator bugs

* fix idiv bug

* remove that test

* more emu fixes
2026-03-24 10:10:46 +08:00
George Hotz
c62dea6881 ai slop flash attention (it works) (#15401)
* ai slop flash attention (it works)

* speed up, 2 TFLOPS + 7 GB/s

* simpler

* simpler

* optimize

* faster

* warp shuffle

* sqtt: link dispatch to exec (#15396)

* sqtt packet linking infra

python

* javascript

* ~doubly linked list

* ui works

* work

* exec can also highlight the pc, coloring work

* more work

* rm sqtt/model.py, doesn't need to be upstreamed

* viz: no context enters in cli, update llama profile (#15404)

* removed unused named arg in rules [pr] (#15414)

* viz: sqtt printer in viz/cli.py (#15411)

* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long

* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)

Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb

Co-authored-by: Amp <amp@ampcode.com>

* works

* cnt=20

* revert that

* uop slice tests

* simpler

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
George Hotz
c13d9d29ff add SHAPED_WMMA (#15400)
* add SHAPED_WMMA

* shaped wmma

* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683 minimal vec in amd_copy_matmul (#15398)
* minimal vec in amd_copy_matmul

* unified

* unify

* reshape/permute

* cleanups

* simpler

* move index

* cleanups

* more shared
2026-03-21 14:57:21 +08:00
George Hotz
1a2a203f48 add wmma support to amd_copy_matmul (#15384)
* add wmma support to amd_copy_matmul

* 15 TFLOPS and merged

* unify

* simpler

* simpler

* simpler

* cleanups

* TM/TN is the full regs

* comments

* WAVES_PER_SH + SQTT_EVENT

* Add WAVERDY support

* no split warp

* 3 range
2026-03-20 19:02:19 +08:00
chenyu
da1700e16b dtypes.index -> dtypes.weakint (#15377) 2026-03-20 01:08:46 -04:00
George Hotz
4091d37e8e flat llama step work (#15355)
* flat llama step work

* fp8 support

* blacklisted matmul

* chestertons fence
2026-03-20 09:06:12 +08:00
George Hotz
6e196195d8 add test for flat llama (#15327)
* add test for flat llama

* simpler

* back to split w1/w3

* env

* still too much ram

* invalid
2026-03-18 15:16:33 +08:00
qazal
5cd1daa3bc cdna asm_gemm in one file, remove old rdna3 asm (#15281) 2026-03-16 04:32:30 +09:00
George Hotz
06d7cddb33 amd_copy_matmul is cleaner (#15248)
* amd_copy_matmul is cleaner

* it runs

* replicated stuff

* add tid there

* it runs

* cleanup

* x.src[1]

* flatten

* move that

* keep that assert
2026-03-14 12:56:09 +08:00
George Hotz
a7d2429c21 amd_uop_matmul more cleanups (#15240) 2026-03-13 10:24:43 +08:00
George Hotz
e560a46f59 update amd_uop_matmul (#15236)
* update amd_uop_matmul

* use custom kernel

* simpler

* ignore
2026-03-12 17:33:12 +08:00
wozeparrot
c35de9bd68 asm_gemm: support more sharding (#15002) 2026-03-02 23:16:37 -08:00
qazal
62ee976c1b gemm/asm: cleanup repeated patterns to helper functions (#15094) 2026-03-03 08:14:47 +09:00
qazal
448e997be4 gemm/asm: cleanup custom function args (#15007) 2026-02-25 22:05:56 +09:00
qazal
f590564bf7 gemm multiple is only for cdna4 asm (#14814)
* gemm multiple is only for cdna4 asm

* move to backend

* and arch

* path
2026-02-17 14:00:02 +09:00
George Hotz
5bd2862d1a late compile the cdna gemm (#14783)
* late compile the cdna gemm

* remove old things

* finalize inplace

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-17 13:04:22 +09:00
George Hotz
f081f154ae parameterize the CDNA asm gemm (#14813)
* parameterize the CDNA asm gemm

* fix llama test

* fix

* add more gemmt ests

* confirm all match

* test these asm gemms
2026-02-17 11:35:18 +08:00
qazal
c7a4dbf918 viz: get program binary from the UOp (#14787)
* viz: get program binary from the UOp

* remove that

* less

* rename View Program to View Source

* two words

* fix
2026-02-16 15:46:58 +09:00
George Hotz
dff9cf35c2 amd asm emulator fixes + run it in CI (#14786)
* amd asm fix, try 2

* fix tests
2026-02-16 13:24:21 +08:00
qazal
55a4dfa2e0 cdna4 asm_gemm tests in CI on the null backend (#14785)
* cdna4 asm_gemm tests in CI on the null backend

* no .numpy() in null

* better

* gemm/asm: device comes from renderer
2026-02-16 14:06:23 +09:00
George Hotz
4088d686b2 remove llvm requirement from amd (#14717)
* remove llvm requirement from amd

* tests pass

* test

* sink kernarg_size

* move stuff

* amd_asm_matmul to new style

* default type

* fix tests, simpler

* cu mode is faster and simpler

* darken
2026-02-13 10:50:12 +08:00
George Hotz
4680247e35 renderer/amd: move in tree (#14702)
* renderer/amd: move in tree

* fix paths in tests

* 24000 lines

* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
befc1e800c assembly/amd: disasm is test only (#14694)
* assembly/amd: disasm is test only

* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
3fab43c57c add cache to asm gemm (#14675) 2026-02-11 08:26:30 +08:00
qazal
80b0119cef llama: add new asm gemm shape (#14611)
* llama: add new asm gemm shape

* work

* cleanup

* half dtype

* more comment
2026-02-10 00:34:29 +09:00
George Hotz
183d38b128 remove CUSTOM_KERNEL / directly construct it (#14604)
* remove CUSTOM_KERNEL / directly construct it

* clean that up

* simpler multi

* custom kernel spec

* remove Kernel

* fix multi

* use sharded shape

* explicit regression test
2026-02-08 18:43:33 +08:00
qazal
cf73d7e2a7 hotfix: disable slower asm gemm shape from llama seqlen 8192 (#14582) 2026-02-06 15:05:19 +09:00
George Hotz
43e7eda4e7 grad_b uses custom gemm (#14550)
* grad_b uses custom gemm

* fix multi backward, acc is in float32

* test_gemm_batched

* square gemm

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9 test asm_gemm in CI (#14551)
* test asm_gemm in CI

* default float16

* use a smaller shape for multi

* smaller size

* smaller for CI

* smaller for ci

* need half
2026-02-05 13:32:22 +09:00
chenyu
d57d24c7d4 Buffer.as_buffer -> Buffer.as_memoryview [pr] (#14535)
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00