* ai slop flash attention (it works)
* speed up, 2 TFLOPS + 7 GB/s
* simpler
* simpler
* optimize
* faster
* warp shuffle
* sqtt: link dispatch to exec (#15396)
* sqtt packet linking infra
python
* javascript
* ~doubly linked list
* ui works
* work
* exec can also highlight the pc, coloring work
* more work
* rm sqtt/model.py, doesn't need to be upstreamed
* viz: no context enters in cli, update llama profile (#15404)
* removed unused named arg in rules [pr] (#15414)
* viz: sqtt printer in viz/cli.py (#15411)
* work
* sqtt timeline in CLI
* format all printers nicely
* s/Showed/Printed
* ansistrip
* sys.exit
* keep colors in list
* work from amd_copy_matmul
* has_more always gets returned
* linter
* don't print colors
* more colors
* wow this is so deep
* work
* minor details
* selected
* improve progress bar
* remove it
* 22, global_load_vaddr is so long
* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)
Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb
Co-authored-by: Amp <amp@ampcode.com>
* works
* cnt=20
* revert that
* uop slice tests
* simpler
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
* work
* sqtt timeline in CLI
* format all printers nicely
* s/Showed/Printed
* ansistrip
* sys.exit
* keep colors in list
* work from amd_copy_matmul
* has_more always gets returned
* linter
* don't print colors
* more colors
* wow this is so deep
* work
* minor details
* selected
* improve progress bar
* remove it
* 22, global_load_vaddr is so long
* add wmma support to amd_copy_matmul
* 15 TFLOPS and merged
* unify
* simpler
* simpler
* simpler
* cleanups
* TM/TN is the full regs
* comments
* WAVES_PER_SH + SQTT_EVENT
* Add WAVERDY support
* no split warp
* 3 range
* viz: variable length rdna barriers
* work
* tiny changes
* simple wave simd test
* small wave sync test
* good multi barrier bug find
* simple fix
* wave_sync asserts
* rdna4 work
* more rdna4
* find more bugs in my model
* it's so much simpler
* wave_sync tests duration
* r4
* should just call this rdna4
* jump cleanup
* assert there's a JUMP
* new example for JUMP
* regenerate examples
* rdna4 work
* new packets
* work
* less for branch handling
* less verbose
* fix err message