72 Commits

Author SHA1 Message Date
wozeparrot
2bdc360606 gemm: mxfp8 hipkittens gemm (#16541)
* gemm: mxfp8 hipkittens gemm

* feat: update hipkittens

* feat: kernel signature

* clean: just kernel

* feat: from tinygrad

* feat: test

* fix: add back utils

* clean: no diff

* clean: no diff
2026-06-09 15:20:05 -07:00
wozeparrot
5ef30005fa update hipkittens (#16544) 2026-06-08 18:53:25 -07:00
qazal
3b1a5f9770 llama: a_bT and aT_b bf16 gemms (#16487)
* hk_bf16_gemm

* enable in 8b

* cleanups

* rename to USE_HK_BF16_GEMM

* work

* work

* work

* work

* change the gemms

* work

* work

* set as default

* work

* change
2026-06-04 23:30:21 +09:00
qazal
bfb2d1f89a Revert "fp8 gemm speedup (#16236)" (#16245)
This reverts commit d95bf394e1.
2026-05-19 02:01:44 +09:00
chenyu
dcee90aa3f remove requires_grad use in extra/examples (#16238)
except the ones fed into optimizer
2026-05-16 18:40:26 -04:00
qazal
d95bf394e1 fp8 gemm speedup (#16236)
* add asm_gemm option

* milestone

* work

* edit

* only the fast kernel

* diff
2026-05-17 04:58:28 +09:00
wozeparrot
528d35e306 llama speed 4 (#15993) 2026-04-30 17:14:41 -07:00
chenyu
9192c93b7e Tensor.invalid -> Tesnor.invalids (#15849)
matches ones and zeros, and to not share name with UOp.invalid
2026-04-21 11:19:51 -04:00
wozeparrot
9e60e4a7e7 llama: native fp8 (#15733) 2026-04-16 22:16:05 -07:00
wozeparrot
457508d5a0 llama: save more 2 (#15681) 2026-04-11 01:03:36 -07:00
wozeparrot
7e54992bf6 fp8 llama (#15588)
Co-authored-by: qazal <qazal.software@gmail.com>
2026-04-04 18:24:57 -07:00
Christopher Milan
645d45d968 DEV has arch (#15577)
Co-authored-by: Comma Device <device@comma.ai>
2026-04-03 19:17:19 -04:00
qazal
8feb8edc68 gemm/asm: add fp8 support to cdna asm_gemm (#15542)
* work

* hmm, mixins

* rhs_transposed

* also fix the dtype

* check for hipcc

* Exception

* select dev

* default
2026-03-31 19:32:54 +09:00
George Hotz
6e196195d8 add test for flat llama (#15327)
* add test for flat llama

* simpler

* back to split w1/w3

* env

* still too much ram

* invalid
2026-03-18 15:16:33 +08:00
wozeparrot
be23772d43 llama3 fixes part2 (#15150) 2026-03-04 23:43:50 -08:00
wozeparrot
4e9b85ecfd fa: pull inputs out of call (#15127) 2026-03-04 03:15:49 -08:00
George Hotz
8ebd24637b fix fa forward building with clang 22 (#15124)
* fix fa forward building with clang 22

* fix: override rocm path

---------

Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-03-04 02:32:25 -08:00
wozeparrot
df23057984 fa: change bwd grid dim + unshuffle using mops (#15068) 2026-03-04 01:23:40 -08:00
wozeparrot
25565b2410 fa: test for mp (#14907) 2026-02-22 21:47:36 -08:00
wozeparrot
9317e96881 fa: explicitly pass shapes (#14857) 2026-02-19 05:26:16 -08:00
wozeparrot
45aebe1572 hipkittens fa backward (#14723) 2026-02-16 00:38:44 -08:00
George Hotz
ac079e43d7 ElementwiseMixin (#14777) 2026-02-16 08:50:47 +08:00
qazal
33b31d9cd6 tinykittens flash attention dtype fix, add CI (#14770)
* don't hardcdoe amd device

* add failing tests, ci too

* fix: fix for dtype mixin

* bump to rocm 7.1

---------

Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-02-16 01:15:11 +09:00
wozeparrot
0613c0ac0c hipkittens fa forward (#14692) 2026-02-12 20:16:43 -08:00
wozeparrot
f73468d516 fa: block skipping for fa kv bwd (#14569) 2026-02-05 16:13:53 -08:00
wozeparrot
c1ea6687e5 fa: simpler is faster (#14548) 2026-02-05 01:13:17 -08:00
wozeparrot
bbcd3d67a3 fa: faster (#14453) 2026-02-02 21:34:17 -08:00
wozeparrot
c2fb8b208f fa: 32 block size (#14416) 2026-01-29 13:59:13 -08:00
wozeparrot
d74587f16d fa multi fix 2 (#14314) 2026-01-23 23:35:02 -08:00
wozeparrot
76a9242a66 fa: merge kv bwd into one kernel (#14277) 2026-01-21 15:24:41 -08:00
wozeparrot
1f89eaf790 tk: fa bert mask fix + some numerical stability improvements (#14214) 2026-01-19 19:18:07 -08:00
wozeparrot
a879b54234 tk: fa jit fix (#14170) 2026-01-16 16:38:45 -08:00
wozeparrot
7e5687f6a3 more fa multi fix (#14152) 2026-01-14 13:57:11 -08:00
wozeparrot
a92778aa0c tk: fa multi fix (#14134) 2026-01-13 17:22:15 -08:00
wozeparrot
2b3e01e79c tk: support sliced local -> reg load (#14034) 2026-01-06 05:33:24 -05:00
wozeparrot
21d0f6bb76 tk: flat global -> local load (#14033) 2026-01-05 23:35:53 -08:00
wozeparrot
6242a9d151 tk: no global copy and clear ranges (#13988) 2026-01-02 23:45:15 -08:00
wozeparrot
9f082e8e25 fa: split kv bwd into 2 kernels (#13981) 2026-01-02 18:45:51 -08:00
wozeparrot
b27527f05a fix: missed inner tracked range (#13964) 2026-01-01 18:09:57 -08:00
wozeparrot
ecbac8a338 tk: fa cleanups + causal test (#13963) 2026-01-01 18:05:00 -08:00
chenyu
80b84f5267 ruff lint tinykitten (#13762)
deleted used import and double spaces. a few ignore to not change the real code
2025-12-19 14:31:00 -05:00
wozeparrot
99e667bdcd tk fa bwd (#13480) 2025-12-17 23:56:37 -08:00
wozeparrot
5151a341b3 tk: small changes from fa bwd (#13732) 2025-12-16 22:44:36 -08:00
wozeparrot
5d509499b2 tk: kernel finish groups stores (#13704) 2025-12-15 09:16:17 -08:00
wozeparrot
7ef7ce2856 tk reg local store (#13689) 2025-12-14 23:07:30 -08:00
wozeparrot
8f60b8dd1e fix: cast on transpose (#13653) 2025-12-11 21:03:49 -08:00
wozeparrot
89c4206e22 fix: typing (#13614) 2025-12-07 20:10:30 -08:00
wozeparrot
93f1baca77 feat: tk fa in tensor (#13580) 2025-12-05 14:36:29 -08:00
wozeparrot
62e2fc5108 tk: global load/store rv (#13577) 2025-12-04 17:23:48 -08:00
wozeparrot
1b7dbfb37f tk: named kernels + per kernel range id (#13522) 2025-12-01 22:51:04 -08:00