wozeparrot
2bdc360606
gemm: mxfp8 hipkittens gemm ( #16541 )
...
* gemm: mxfp8 hipkittens gemm
* feat: update hipkittens
* feat: kernel signature
* clean: just kernel
* feat: from tinygrad
* feat: test
* fix: add back utils
* clean: no diff
* clean: no diff
2026-06-09 15:20:05 -07:00
wozeparrot
5ef30005fa
update hipkittens ( #16544 )
2026-06-08 18:53:25 -07:00
qazal
3b1a5f9770
llama: a_bT and aT_b bf16 gemms ( #16487 )
...
* hk_bf16_gemm
* enable in 8b
* cleanups
* rename to USE_HK_BF16_GEMM
* work
* work
* work
* work
* change the gemms
* work
* work
* set as default
* work
* change
2026-06-04 23:30:21 +09:00
qazal
bfb2d1f89a
Revert "fp8 gemm speedup ( #16236 )" ( #16245 )
...
This reverts commit d95bf394e1 .
2026-05-19 02:01:44 +09:00
chenyu
dcee90aa3f
remove requires_grad use in extra/examples ( #16238 )
...
except the ones fed into optimizer
2026-05-16 18:40:26 -04:00
qazal
d95bf394e1
fp8 gemm speedup ( #16236 )
...
* add asm_gemm option
* milestone
* work
* edit
* only the fast kernel
* diff
2026-05-17 04:58:28 +09:00
wozeparrot
528d35e306
llama speed 4 ( #15993 )
2026-04-30 17:14:41 -07:00
chenyu
9192c93b7e
Tensor.invalid -> Tesnor.invalids ( #15849 )
...
matches ones and zeros, and to not share name with UOp.invalid
2026-04-21 11:19:51 -04:00
wozeparrot
9e60e4a7e7
llama: native fp8 ( #15733 )
2026-04-16 22:16:05 -07:00
wozeparrot
457508d5a0
llama: save more 2 ( #15681 )
2026-04-11 01:03:36 -07:00
wozeparrot
7e54992bf6
fp8 llama ( #15588 )
...
Co-authored-by: qazal <qazal.software@gmail.com >
2026-04-04 18:24:57 -07:00
Christopher Milan
645d45d968
DEV has arch ( #15577 )
...
Co-authored-by: Comma Device <device@comma.ai >
2026-04-03 19:17:19 -04:00
qazal
8feb8edc68
gemm/asm: add fp8 support to cdna asm_gemm ( #15542 )
...
* work
* hmm, mixins
* rhs_transposed
* also fix the dtype
* check for hipcc
* Exception
* select dev
* default
2026-03-31 19:32:54 +09:00
George Hotz
6e196195d8
add test for flat llama ( #15327 )
...
* add test for flat llama
* simpler
* back to split w1/w3
* env
* still too much ram
* invalid
2026-03-18 15:16:33 +08:00
wozeparrot
be23772d43
llama3 fixes part2 ( #15150 )
2026-03-04 23:43:50 -08:00
wozeparrot
4e9b85ecfd
fa: pull inputs out of call ( #15127 )
2026-03-04 03:15:49 -08:00
George Hotz
8ebd24637b
fix fa forward building with clang 22 ( #15124 )
...
* fix fa forward building with clang 22
* fix: override rocm path
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com >
2026-03-04 02:32:25 -08:00
wozeparrot
df23057984
fa: change bwd grid dim + unshuffle using mops ( #15068 )
2026-03-04 01:23:40 -08:00
wozeparrot
25565b2410
fa: test for mp ( #14907 )
2026-02-22 21:47:36 -08:00
wozeparrot
9317e96881
fa: explicitly pass shapes ( #14857 )
2026-02-19 05:26:16 -08:00
wozeparrot
45aebe1572
hipkittens fa backward ( #14723 )
2026-02-16 00:38:44 -08:00
George Hotz
ac079e43d7
ElementwiseMixin ( #14777 )
2026-02-16 08:50:47 +08:00
qazal
33b31d9cd6
tinykittens flash attention dtype fix, add CI ( #14770 )
...
* don't hardcdoe amd device
* add failing tests, ci too
* fix: fix for dtype mixin
* bump to rocm 7.1
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com >
2026-02-16 01:15:11 +09:00
wozeparrot
0613c0ac0c
hipkittens fa forward ( #14692 )
2026-02-12 20:16:43 -08:00
wozeparrot
f73468d516
fa: block skipping for fa kv bwd ( #14569 )
2026-02-05 16:13:53 -08:00
wozeparrot
c1ea6687e5
fa: simpler is faster ( #14548 )
2026-02-05 01:13:17 -08:00
wozeparrot
bbcd3d67a3
fa: faster ( #14453 )
2026-02-02 21:34:17 -08:00
wozeparrot
c2fb8b208f
fa: 32 block size ( #14416 )
2026-01-29 13:59:13 -08:00
wozeparrot
d74587f16d
fa multi fix 2 ( #14314 )
2026-01-23 23:35:02 -08:00
wozeparrot
76a9242a66
fa: merge kv bwd into one kernel ( #14277 )
2026-01-21 15:24:41 -08:00
wozeparrot
1f89eaf790
tk: fa bert mask fix + some numerical stability improvements ( #14214 )
2026-01-19 19:18:07 -08:00
wozeparrot
a879b54234
tk: fa jit fix ( #14170 )
2026-01-16 16:38:45 -08:00
wozeparrot
7e5687f6a3
more fa multi fix ( #14152 )
2026-01-14 13:57:11 -08:00
wozeparrot
a92778aa0c
tk: fa multi fix ( #14134 )
2026-01-13 17:22:15 -08:00
wozeparrot
2b3e01e79c
tk: support sliced local -> reg load ( #14034 )
2026-01-06 05:33:24 -05:00
wozeparrot
21d0f6bb76
tk: flat global -> local load ( #14033 )
2026-01-05 23:35:53 -08:00
wozeparrot
6242a9d151
tk: no global copy and clear ranges ( #13988 )
2026-01-02 23:45:15 -08:00
wozeparrot
9f082e8e25
fa: split kv bwd into 2 kernels ( #13981 )
2026-01-02 18:45:51 -08:00
wozeparrot
b27527f05a
fix: missed inner tracked range ( #13964 )
2026-01-01 18:09:57 -08:00
wozeparrot
ecbac8a338
tk: fa cleanups + causal test ( #13963 )
2026-01-01 18:05:00 -08:00
chenyu
80b84f5267
ruff lint tinykitten ( #13762 )
...
deleted used import and double spaces. a few ignore to not change the real code
2025-12-19 14:31:00 -05:00
wozeparrot
99e667bdcd
tk fa bwd ( #13480 )
2025-12-17 23:56:37 -08:00
wozeparrot
5151a341b3
tk: small changes from fa bwd ( #13732 )
2025-12-16 22:44:36 -08:00
wozeparrot
5d509499b2
tk: kernel finish groups stores ( #13704 )
2025-12-15 09:16:17 -08:00
wozeparrot
7ef7ce2856
tk reg local store ( #13689 )
2025-12-14 23:07:30 -08:00
wozeparrot
8f60b8dd1e
fix: cast on transpose ( #13653 )
2025-12-11 21:03:49 -08:00
wozeparrot
89c4206e22
fix: typing ( #13614 )
2025-12-07 20:10:30 -08:00
wozeparrot
93f1baca77
feat: tk fa in tensor ( #13580 )
2025-12-05 14:36:29 -08:00
wozeparrot
62e2fc5108
tk: global load/store rv ( #13577 )
2025-12-04 17:23:48 -08:00
wozeparrot
1b7dbfb37f
tk: named kernels + per kernel range id ( #13522 )
2025-12-01 22:51:04 -08:00