1714 Commits

Author SHA1 Message Date
Christopher Milan
d5320a9ddf QCOM cleanups (#15435) 2026-03-23 22:18:38 -04:00
George Hotz
85dee83f5d amd flash attention cleanups + emulator fixes (#15431)
* amd flash attention cleanups

* simpler

* params

* fix emulator bugs

* fix idiv bug

* remove that test

* more emu fixes
2026-03-24 10:10:46 +08:00
qazal
109472c37e sqtt: new s_barrier pickles, handle rdna4 barriers in emulator (#15437) 2026-03-24 03:25:28 +09:00
nimlgen
fa4cdb422e memplan on linears (#15422)
* memplan

* test

* x

* arenas

* correct

* set any size

* ugh

* make hevc happy

* x

* x

* held

* rm old

* del

* x

* fu

* f

* cl

* cl

* ok
2026-03-23 19:50:16 +08:00
George Hotz
c62dea6881 ai slop flash attention (it works) (#15401)
* ai slop flash attention (it works)

* speed up, 2 TFLOPS + 7 GB/s

* simpler

* simpler

* optimize

* faster

* warp shuffle

* sqtt: link dispatch to exec (#15396)

* sqtt packet linking infra

python

* javascript

* ~doubly linked list

* ui works

* work

* exec can also highlight the pc, coloring work

* more work

* rm sqtt/model.py, doesn't need to be upstreamed

* viz: no context enters in cli, update llama profile (#15404)

* removed unused named arg in rules [pr] (#15414)

* viz: sqtt printer in viz/cli.py (#15411)

* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long

* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)

Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb

Co-authored-by: Amp <amp@ampcode.com>

* works

* cnt=20

* revert that

* uop slice tests

* simpler

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
qazal
fd3559103b viz/cli: better error message for empty itrace (#15425) 2026-03-23 15:50:20 +09:00
qazal
c7b18e6108 viz: sqtt printer in viz/cli.py (#15411)
* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long
2026-03-23 00:17:05 +09:00
qazal
2363bceb47 viz: no context enters in cli, update llama profile (#15404) 2026-03-22 05:47:02 +09:00
George Hotz
c13d9d29ff add SHAPED_WMMA (#15400)
* add SHAPED_WMMA

* shaped wmma

* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683 minimal vec in amd_copy_matmul (#15398)
* minimal vec in amd_copy_matmul

* unified

* unify

* reshape/permute

* cleanups

* simpler

* move index

* cleanups

* more shared
2026-03-21 14:57:21 +08:00
qazal
71ccc69c52 FP8=1 llama works again, hipcc can run on macos (#15394)
* hipcc macos shim

* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
George Hotz
1a2a203f48 add wmma support to amd_copy_matmul (#15384)
* add wmma support to amd_copy_matmul

* 15 TFLOPS and merged

* unify

* simpler

* simpler

* simpler

* cleanups

* TM/TN is the full regs

* comments

* WAVES_PER_SH + SQTT_EVENT

* Add WAVERDY support

* no split warp

* 3 range
2026-03-20 19:02:19 +08:00
chenyu
c491345766 pass device into Tensor._frompy (#15385)
* pass device into Tensor._frompy

with this, canonicalize_device is the only usage of Device in tensor.py

* export_model.py
2026-03-20 05:09:01 -04:00
chenyu
da1700e16b dtypes.index -> dtypes.weakint (#15377) 2026-03-20 01:08:46 -04:00
George Hotz
4091d37e8e flat llama step work (#15355)
* flat llama step work

* fp8 support

* blacklisted matmul

* chestertons fence
2026-03-20 09:06:12 +08:00
George Hotz
70dad9d642 add PING to RemoteCmd (#15371)
* add PING to RemoteCmd

* cleanup
2026-03-19 18:57:40 +08:00
nimlgen
ff004d2114 remote: fix mmio (#15347) 2026-03-18 18:20:39 +08:00
George Hotz
6e196195d8 add test for flat llama (#15327)
* add test for flat llama

* simpler

* back to split w1/w3

* env

* still too much ram

* invalid
2026-03-18 15:16:33 +08:00
nimlgen
0315faf938 remote bench (#15331) 2026-03-18 14:03:51 +08:00
wozeparrot
b45edeb965 fix: rand supports large tensors (#15329) 2026-03-17 15:45:41 -07:00
qazal
00817cf65e viz: all tests can run on the NULL device (#15328)
* remove that

* move to test_viz

* get_cfg

* do not use os.environ

* hm

* it's always on NULL

* import renderer

* no import *
2026-03-18 04:14:20 +09:00
nimlgen
0a641ce17d system: remote (#15318)
* system: remote

* listen

* print

* fix

* minor
2026-03-17 19:25:37 +08:00
nimlgen
a50fdb0528 nvcc macos (#15308)
* fix nvcc install macos

* um

* arm

* per

* tm
2026-03-17 17:25:33 +08:00
nimlgen
e1c2d09720 system: rebar to remote devs (#15316) 2026-03-17 16:09:12 +08:00
nimlgen
27e29127b5 system: remote prereqs (#15290)
* x

* new format for apl

* this

* typing

* rpc

* tuple

* linter+new tinygpu
2026-03-16 18:45:41 +08:00
nimlgen
e7705fe311 system: pcidev doesn't care about bars (#15284) 2026-03-16 14:45:43 +08:00
nimlgen
ff0bcc8de0 system: iface p1 changes (#15278) 2026-03-16 10:48:25 +08:00
qazal
4445f50356 viz: variable duration rdna barriers (#15277)
* viz: variable length rdna barriers

* work

* tiny changes

* simple wave simd test

* small wave sync test

* good multi barrier bug find

* simple fix

* wave_sync asserts

* rdna4 work

* more rdna4

* find more bugs in my model

* it's so much simpler

* wave_sync tests duration

* r4

* should just call this rdna4
2026-03-16 06:06:19 +09:00
qazal
5cd1daa3bc cdna asm_gemm in one file, remove old rdna3 asm (#15281) 2026-03-16 04:32:30 +09:00
qazal
7b6211fdd7 sqtt: remove discover_ops script (#15279) 2026-03-15 22:17:06 +09:00
qazal
3858bfc83d sqtt: CDNA inst decodes (#15274)
* sqtt: CDNA inst decodes

* JUMP packets other way

* cdna insts

* r3

* r4

* lds from simd1 and simd2
2026-03-14 21:03:46 +09:00
George Hotz
06d7cddb33 amd_copy_matmul is cleaner (#15248)
* amd_copy_matmul is cleaner

* it runs

* replicated stuff

* add tid there

* it runs

* cleanup

* x.src[1]

* flatten

* move that

* keep that assert
2026-03-14 12:56:09 +08:00
nimlgen
bc16f80b50 am: remove dma_regions param (#15251)
* am: remove dma_regions param

* linter
2026-03-13 18:12:48 +08:00
George Hotz
a7d2429c21 amd_uop_matmul more cleanups (#15240) 2026-03-13 10:24:43 +08:00
George Hotz
e560a46f59 update amd_uop_matmul (#15236)
* update amd_uop_matmul

* use custom kernel

* simpler

* ignore
2026-03-12 17:33:12 +08:00
chenyu
842c978df3 remove staticmethod dtypes.max/min (#15227)
always use x.dtype.max/min
2026-03-11 23:11:24 -04:00
qazal
d3eef70162 viz: render shader clock frequency graph (#15197) 2026-03-12 01:32:49 +09:00
nimlgen
086081e35b tbgpu: add stapler to the script (#15180) 2026-03-07 00:07:27 +03:00
qazal
83f1faa142 sqtt: update CDNA wave packet field, start unskipping tests (#15168)
* correct field names

* packet types

* packet 5 is regc

* test skips
2026-03-06 21:37:44 +09:00
Roelof van Dijk
d65923bda5 tensor.py: add normalize function (#15159)
* tensor.py: add normalize function

* p==0 should match torch
2026-03-05 18:55:53 +08:00
wozeparrot
be23772d43 llama3 fixes part2 (#15150) 2026-03-04 23:43:50 -08:00
qazal
33a1970045 sqtt: simplify inst mapping, validate JUMP processing in CI (#15139)
* jump cleanup

* assert there's a JUMP

* new example for JUMP

* regenerate examples

* rdna4 work

* new packets

* work

* less for branch handling

* less verbose

* fix err message
2026-03-05 09:53:12 +09:00
nimlgen
cdc48da9cd hevc: assert and speed (#15122)
* hevc: assert and speed

* simpler
2026-03-04 19:01:02 +03:00
wozeparrot
4e9b85ecfd fa: pull inputs out of call (#15127) 2026-03-04 03:15:49 -08:00
George Hotz
8ebd24637b fix fa forward building with clang 22 (#15124)
* fix fa forward building with clang 22

* fix: override rocm path

---------

Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-03-04 02:32:25 -08:00
wozeparrot
df23057984 fa: change bwd grid dim + unshuffle using mops (#15068) 2026-03-04 01:23:40 -08:00
qazal
8dd691761d sqtt: remove old files (#15108) 2026-03-03 22:43:24 +09:00
Christopher Milan
de043226ba benchmark comma usbgpu driving_vision step and load time (#15103)
Co-authored-by: Comma Device <device@comma.ai>
2026-03-03 06:08:03 -05:00
wozeparrot
c35de9bd68 asm_gemm: support more sharding (#15002) 2026-03-02 23:16:37 -08:00
qazal
62ee976c1b gemm/asm: cleanup repeated patterns to helper functions (#15094) 2026-03-03 08:14:47 +09:00