wozeparrot
|
528d35e306
|
llama speed 4 (#15993)
|
2026-04-30 17:14:41 -07:00 |
|
chenyu
|
9192c93b7e
|
Tensor.invalid -> Tesnor.invalids (#15849)
matches ones and zeros, and to not share name with UOp.invalid
|
2026-04-21 11:19:51 -04:00 |
|
wozeparrot
|
9e60e4a7e7
|
llama: native fp8 (#15733)
|
2026-04-16 22:16:05 -07:00 |
|
wozeparrot
|
457508d5a0
|
llama: save more 2 (#15681)
|
2026-04-11 01:03:36 -07:00 |
|
wozeparrot
|
7e54992bf6
|
fp8 llama (#15588)
Co-authored-by: qazal <qazal.software@gmail.com>
|
2026-04-04 18:24:57 -07:00 |
|
Christopher Milan
|
645d45d968
|
DEV has arch (#15577)
Co-authored-by: Comma Device <device@comma.ai>
|
2026-04-03 19:17:19 -04:00 |
|
qazal
|
8feb8edc68
|
gemm/asm: add fp8 support to cdna asm_gemm (#15542)
* work
* hmm, mixins
* rhs_transposed
* also fix the dtype
* check for hipcc
* Exception
* select dev
* default
|
2026-03-31 19:32:54 +09:00 |
|
George Hotz
|
6e196195d8
|
add test for flat llama (#15327)
* add test for flat llama
* simpler
* back to split w1/w3
* env
* still too much ram
* invalid
|
2026-03-18 15:16:33 +08:00 |
|
wozeparrot
|
be23772d43
|
llama3 fixes part2 (#15150)
|
2026-03-04 23:43:50 -08:00 |
|
wozeparrot
|
4e9b85ecfd
|
fa: pull inputs out of call (#15127)
|
2026-03-04 03:15:49 -08:00 |
|
George Hotz
|
8ebd24637b
|
fix fa forward building with clang 22 (#15124)
* fix fa forward building with clang 22
* fix: override rocm path
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
|
2026-03-04 02:32:25 -08:00 |
|
wozeparrot
|
df23057984
|
fa: change bwd grid dim + unshuffle using mops (#15068)
|
2026-03-04 01:23:40 -08:00 |
|
wozeparrot
|
25565b2410
|
fa: test for mp (#14907)
|
2026-02-22 21:47:36 -08:00 |
|
wozeparrot
|
9317e96881
|
fa: explicitly pass shapes (#14857)
|
2026-02-19 05:26:16 -08:00 |
|
wozeparrot
|
45aebe1572
|
hipkittens fa backward (#14723)
|
2026-02-16 00:38:44 -08:00 |
|
George Hotz
|
ac079e43d7
|
ElementwiseMixin (#14777)
|
2026-02-16 08:50:47 +08:00 |
|
qazal
|
33b31d9cd6
|
tinykittens flash attention dtype fix, add CI (#14770)
* don't hardcdoe amd device
* add failing tests, ci too
* fix: fix for dtype mixin
* bump to rocm 7.1
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
|
2026-02-16 01:15:11 +09:00 |
|
wozeparrot
|
0613c0ac0c
|
hipkittens fa forward (#14692)
|
2026-02-12 20:16:43 -08:00 |
|
wozeparrot
|
f73468d516
|
fa: block skipping for fa kv bwd (#14569)
|
2026-02-05 16:13:53 -08:00 |
|
wozeparrot
|
c1ea6687e5
|
fa: simpler is faster (#14548)
|
2026-02-05 01:13:17 -08:00 |
|
wozeparrot
|
bbcd3d67a3
|
fa: faster (#14453)
|
2026-02-02 21:34:17 -08:00 |
|
wozeparrot
|
c2fb8b208f
|
fa: 32 block size (#14416)
|
2026-01-29 13:59:13 -08:00 |
|
wozeparrot
|
d74587f16d
|
fa multi fix 2 (#14314)
|
2026-01-23 23:35:02 -08:00 |
|
wozeparrot
|
76a9242a66
|
fa: merge kv bwd into one kernel (#14277)
|
2026-01-21 15:24:41 -08:00 |
|
wozeparrot
|
1f89eaf790
|
tk: fa bert mask fix + some numerical stability improvements (#14214)
|
2026-01-19 19:18:07 -08:00 |
|
wozeparrot
|
a879b54234
|
tk: fa jit fix (#14170)
|
2026-01-16 16:38:45 -08:00 |
|
wozeparrot
|
7e5687f6a3
|
more fa multi fix (#14152)
|
2026-01-14 13:57:11 -08:00 |
|
wozeparrot
|
a92778aa0c
|
tk: fa multi fix (#14134)
|
2026-01-13 17:22:15 -08:00 |
|
wozeparrot
|
2b3e01e79c
|
tk: support sliced local -> reg load (#14034)
|
2026-01-06 05:33:24 -05:00 |
|
wozeparrot
|
21d0f6bb76
|
tk: flat global -> local load (#14033)
|
2026-01-05 23:35:53 -08:00 |
|
wozeparrot
|
6242a9d151
|
tk: no global copy and clear ranges (#13988)
|
2026-01-02 23:45:15 -08:00 |
|
wozeparrot
|
9f082e8e25
|
fa: split kv bwd into 2 kernels (#13981)
|
2026-01-02 18:45:51 -08:00 |
|
wozeparrot
|
b27527f05a
|
fix: missed inner tracked range (#13964)
|
2026-01-01 18:09:57 -08:00 |
|
wozeparrot
|
ecbac8a338
|
tk: fa cleanups + causal test (#13963)
|
2026-01-01 18:05:00 -08:00 |
|
chenyu
|
80b84f5267
|
ruff lint tinykitten (#13762)
deleted used import and double spaces. a few ignore to not change the real code
|
2025-12-19 14:31:00 -05:00 |
|
wozeparrot
|
99e667bdcd
|
tk fa bwd (#13480)
|
2025-12-17 23:56:37 -08:00 |
|
wozeparrot
|
5151a341b3
|
tk: small changes from fa bwd (#13732)
|
2025-12-16 22:44:36 -08:00 |
|
wozeparrot
|
5d509499b2
|
tk: kernel finish groups stores (#13704)
|
2025-12-15 09:16:17 -08:00 |
|
wozeparrot
|
7ef7ce2856
|
tk reg local store (#13689)
|
2025-12-14 23:07:30 -08:00 |
|
wozeparrot
|
8f60b8dd1e
|
fix: cast on transpose (#13653)
|
2025-12-11 21:03:49 -08:00 |
|
wozeparrot
|
89c4206e22
|
fix: typing (#13614)
|
2025-12-07 20:10:30 -08:00 |
|
wozeparrot
|
93f1baca77
|
feat: tk fa in tensor (#13580)
|
2025-12-05 14:36:29 -08:00 |
|
wozeparrot
|
62e2fc5108
|
tk: global load/store rv (#13577)
|
2025-12-04 17:23:48 -08:00 |
|
wozeparrot
|
1b7dbfb37f
|
tk: named kernels + per kernel range id (#13522)
|
2025-12-01 22:51:04 -08:00 |
|
wozeparrot
|
ffc31a23f4
|
tk mi350 (#13288)
|
2025-11-25 15:49:44 -08:00 |
|
wozeparrot
|
f46bc31156
|
tk: start and step in range (#13442)
|
2025-11-24 15:43:24 -08:00 |
|
wozeparrot
|
56b2540349
|
tk: keep extra tile data by replacing uop (#13370)
|
2025-11-19 15:11:43 -08:00 |
|
wozeparrot
|
be72b78dcb
|
tk: small fixes (#13345)
* fix: handle case where final uop isn't a tk wrapped one
* clean: remove after from mma
|
2025-11-19 00:58:50 -08:00 |
|
wozeparrot
|
33773fda87
|
tk initial mi350 (#13289)
|
2025-11-17 11:46:32 -08:00 |
|
wozeparrot
|
ef42334239
|
tk: load store cleanup (#13290)
|
2025-11-15 17:08:23 -08:00 |
|