qazal
2363bceb47
viz: no context enters in cli, update llama profile ( #15404 )
2026-03-22 05:47:02 +09:00
George Hotz
c13d9d29ff
add SHAPED_WMMA ( #15400 )
...
* add SHAPED_WMMA
* shaped wmma
* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683
minimal vec in amd_copy_matmul ( #15398 )
...
* minimal vec in amd_copy_matmul
* unified
* unify
* reshape/permute
* cleanups
* simpler
* move index
* cleanups
* more shared
2026-03-21 14:57:21 +08:00
qazal
71ccc69c52
FP8=1 llama works again, hipcc can run on macos ( #15394 )
...
* hipcc macos shim
* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
George Hotz
1a2a203f48
add wmma support to amd_copy_matmul ( #15384 )
...
* add wmma support to amd_copy_matmul
* 15 TFLOPS and merged
* unify
* simpler
* simpler
* simpler
* cleanups
* TM/TN is the full regs
* comments
* WAVES_PER_SH + SQTT_EVENT
* Add WAVERDY support
* no split warp
* 3 range
2026-03-20 19:02:19 +08:00
chenyu
c491345766
pass device into Tensor._frompy ( #15385 )
...
* pass device into Tensor._frompy
with this, canonicalize_device is the only usage of Device in tensor.py
* export_model.py
2026-03-20 05:09:01 -04:00
chenyu
da1700e16b
dtypes.index -> dtypes.weakint ( #15377 )
2026-03-20 01:08:46 -04:00
George Hotz
4091d37e8e
flat llama step work ( #15355 )
...
* flat llama step work
* fp8 support
* blacklisted matmul
* chestertons fence
2026-03-20 09:06:12 +08:00
George Hotz
70dad9d642
add PING to RemoteCmd ( #15371 )
...
* add PING to RemoteCmd
* cleanup
2026-03-19 18:57:40 +08:00
nimlgen
ff004d2114
remote: fix mmio ( #15347 )
2026-03-18 18:20:39 +08:00
George Hotz
6e196195d8
add test for flat llama ( #15327 )
...
* add test for flat llama
* simpler
* back to split w1/w3
* env
* still too much ram
* invalid
2026-03-18 15:16:33 +08:00
nimlgen
0315faf938
remote bench ( #15331 )
2026-03-18 14:03:51 +08:00
wozeparrot
b45edeb965
fix: rand supports large tensors ( #15329 )
2026-03-17 15:45:41 -07:00
qazal
00817cf65e
viz: all tests can run on the NULL device ( #15328 )
...
* remove that
* move to test_viz
* get_cfg
* do not use os.environ
* hm
* it's always on NULL
* import renderer
* no import *
2026-03-18 04:14:20 +09:00
nimlgen
0a641ce17d
system: remote ( #15318 )
...
* system: remote
* listen
* print
* fix
* minor
2026-03-17 19:25:37 +08:00
nimlgen
a50fdb0528
nvcc macos ( #15308 )
...
* fix nvcc install macos
* um
* arm
* per
* tm
2026-03-17 17:25:33 +08:00
nimlgen
e1c2d09720
system: rebar to remote devs ( #15316 )
2026-03-17 16:09:12 +08:00
nimlgen
27e29127b5
system: remote prereqs ( #15290 )
...
* x
* new format for apl
* this
* typing
* rpc
* tuple
* linter+new tinygpu
2026-03-16 18:45:41 +08:00
nimlgen
e7705fe311
system: pcidev doesn't care about bars ( #15284 )
2026-03-16 14:45:43 +08:00
nimlgen
ff0bcc8de0
system: iface p1 changes ( #15278 )
2026-03-16 10:48:25 +08:00
qazal
4445f50356
viz: variable duration rdna barriers ( #15277 )
...
* viz: variable length rdna barriers
* work
* tiny changes
* simple wave simd test
* small wave sync test
* good multi barrier bug find
* simple fix
* wave_sync asserts
* rdna4 work
* more rdna4
* find more bugs in my model
* it's so much simpler
* wave_sync tests duration
* r4
* should just call this rdna4
2026-03-16 06:06:19 +09:00
qazal
5cd1daa3bc
cdna asm_gemm in one file, remove old rdna3 asm ( #15281 )
2026-03-16 04:32:30 +09:00
qazal
7b6211fdd7
sqtt: remove discover_ops script ( #15279 )
2026-03-15 22:17:06 +09:00
qazal
3858bfc83d
sqtt: CDNA inst decodes ( #15274 )
...
* sqtt: CDNA inst decodes
* JUMP packets other way
* cdna insts
* r3
* r4
* lds from simd1 and simd2
2026-03-14 21:03:46 +09:00
George Hotz
06d7cddb33
amd_copy_matmul is cleaner ( #15248 )
...
* amd_copy_matmul is cleaner
* it runs
* replicated stuff
* add tid there
* it runs
* cleanup
* x.src[1]
* flatten
* move that
* keep that assert
2026-03-14 12:56:09 +08:00
nimlgen
bc16f80b50
am: remove dma_regions param ( #15251 )
...
* am: remove dma_regions param
* linter
2026-03-13 18:12:48 +08:00
George Hotz
a7d2429c21
amd_uop_matmul more cleanups ( #15240 )
2026-03-13 10:24:43 +08:00
George Hotz
e560a46f59
update amd_uop_matmul ( #15236 )
...
* update amd_uop_matmul
* use custom kernel
* simpler
* ignore
2026-03-12 17:33:12 +08:00
chenyu
842c978df3
remove staticmethod dtypes.max/min ( #15227 )
...
always use x.dtype.max/min
2026-03-11 23:11:24 -04:00
qazal
d3eef70162
viz: render shader clock frequency graph ( #15197 )
2026-03-12 01:32:49 +09:00
nimlgen
086081e35b
tbgpu: add stapler to the script ( #15180 )
2026-03-07 00:07:27 +03:00
qazal
83f1faa142
sqtt: update CDNA wave packet field, start unskipping tests ( #15168 )
...
* correct field names
* packet types
* packet 5 is regc
* test skips
2026-03-06 21:37:44 +09:00
Roelof van Dijk
d65923bda5
tensor.py: add normalize function ( #15159 )
...
* tensor.py: add normalize function
* p==0 should match torch
2026-03-05 18:55:53 +08:00
wozeparrot
be23772d43
llama3 fixes part2 ( #15150 )
2026-03-04 23:43:50 -08:00
qazal
33a1970045
sqtt: simplify inst mapping, validate JUMP processing in CI ( #15139 )
...
* jump cleanup
* assert there's a JUMP
* new example for JUMP
* regenerate examples
* rdna4 work
* new packets
* work
* less for branch handling
* less verbose
* fix err message
2026-03-05 09:53:12 +09:00
nimlgen
cdc48da9cd
hevc: assert and speed ( #15122 )
...
* hevc: assert and speed
* simpler
2026-03-04 19:01:02 +03:00
wozeparrot
4e9b85ecfd
fa: pull inputs out of call ( #15127 )
2026-03-04 03:15:49 -08:00
George Hotz
8ebd24637b
fix fa forward building with clang 22 ( #15124 )
...
* fix fa forward building with clang 22
* fix: override rocm path
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com >
2026-03-04 02:32:25 -08:00
wozeparrot
df23057984
fa: change bwd grid dim + unshuffle using mops ( #15068 )
2026-03-04 01:23:40 -08:00
qazal
8dd691761d
sqtt: remove old files ( #15108 )
2026-03-03 22:43:24 +09:00
Christopher Milan
de043226ba
benchmark comma usbgpu driving_vision step and load time ( #15103 )
...
Co-authored-by: Comma Device <device@comma.ai >
2026-03-03 06:08:03 -05:00
wozeparrot
c35de9bd68
asm_gemm: support more sharding ( #15002 )
2026-03-02 23:16:37 -08:00
qazal
62ee976c1b
gemm/asm: cleanup repeated patterns to helper functions ( #15094 )
2026-03-03 08:14:47 +09:00
nimlgen
dfa180413d
tbgpu: sign nv ( #15087 )
2026-03-02 22:58:30 +03:00
chenyu
71f228f80f
test exact kernel count in torch_backend/test_kernel_fusion ( #15091 )
2026-03-02 14:26:32 -05:00
qazal
f7aeff6061
viz: cli.py cleanups, do not require PYTHONPATH ( #15085 )
...
* cleanup the print
* sys.exit
* equal check
* cleanup unpacker
* cli doesn't need PYTHONPATH
* no semicolons
* %s/PYTHONPATH=. //g
2026-03-02 19:24:38 +09:00
qazal
b8a55d5f68
sqtt: new packet types, add discovery script ( #14960 )
2026-02-28 04:27:27 +09:00
qazal
448e997be4
gemm/asm: cleanup custom function args ( #15007 )
2026-02-25 22:05:56 +09:00
wozeparrot
8d9545e09e
llama3: correctly shard wqkv ( #14978 )
2026-02-23 23:57:10 -08:00
wozeparrot
25565b2410
fa: test for mp ( #14907 )
2026-02-22 21:47:36 -08:00