tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-06-15 09:33:03 +08:00

Author	SHA1	Message	Date
chenyu	3394d18066	size*itemsize -> nbytes (#15729 ) and some UOp.size removal to prep for size to mixin change	2026-04-14 16:27:54 -04:00
wozeparrot	55bcd7cc9e	llama amax outside (#15670 )	2026-04-09 23:08:03 -07:00
George Hotz	48a7627b04	add RDNA4 support to copy WMMA (#15663 ) * add RDNA4 supportt to copy WMMA * simpler * simpler * comment * assert	2026-04-09 22:48:20 +08:00
George Hotz	1ebeb52e59	RDNA4 asm gemm (#15427 ) * sqtt: rdna4 decoder work * diff cleanup * more diff * test * 125 * r4 --------- Co-authored-by: qazal <qazal.software@gmail.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2026-04-08 21:26:44 +08:00
wozeparrot	70dbd35023	llama: move custom_kernel into flat_llama (#15643 )	2026-04-08 00:19:14 -07:00
wozeparrot	7e54992bf6	fp8 llama (#15588 ) Co-authored-by: qazal <qazal.software@gmail.com>	2026-04-04 18:24:57 -07:00
Christopher Milan	0ed8d9271d	Renderers accept Target or nothing (#15590 )	2026-04-03 01:09:41 -04:00
qazal	fefb0ebc2a	gemm/asm: fp8 cleanups (#15580 ) * normal gemm here * s/dtypes.fp8e4m3/FP8_DTYPE * gemm_bw * device UOp stays NULL	2026-04-02 19:02:38 +09:00
chenyu	1aa04eab08	simple CreationMixin (#15567 ) start with full_like, zeros_like, ones_like	2026-04-01 23:00:56 -04:00
qazal	8feb8edc68	gemm/asm: add fp8 support to cdna asm_gemm (#15542 ) * work * hmm, mixins * rhs_transposed * also fix the dtype * check for hipcc * Exception * select dev * default	2026-03-31 19:32:54 +09:00
George Hotz	85dee83f5d	amd flash attention cleanups + emulator fixes (#15431 ) * amd flash attention cleanups * simpler * params * fix emulator bugs * fix idiv bug * remove that test * more emu fixes	2026-03-24 10:10:46 +08:00
George Hotz	c62dea6881	ai slop flash attention (it works) (#15401 ) * ai slop flash attention (it works) * speed up, 2 TFLOPS + 7 GB/s * simpler * simpler * optimize * faster * warp shuffle * sqtt: link dispatch to exec (#15396) * sqtt packet linking infra python * javascript * ~doubly linked list * ui works * work * exec can also highlight the pc, coloring work * more work * rm sqtt/model.py, doesn't need to be upstreamed * viz: no context enters in cli, update llama profile (#15404) * removed unused named arg in rules [pr] (#15414) * viz: sqtt printer in viz/cli.py (#15411) * work * sqtt timeline in CLI * format all printers nicely * s/Showed/Printed * ansistrip * sys.exit * keep colors in list * work from amd_copy_matmul * has_more always gets returned * linter * don't print colors * more colors * wow this is so deep * work * minor details * selected * improve progress bar * remove it * 22, global_load_vaddr is so long * remove 0 hack in sign, gradient materializes zeros for unconnected nodes (#15416) Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb Co-authored-by: Amp <amp@ampcode.com> works * cnt=20 * revert that * uop slice tests * simpler --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: gg <ggordbegli@gmail.com> Co-authored-by: Amp <amp@ampcode.com>	2026-03-23 16:15:10 +08:00
George Hotz	c13d9d29ff	add SHAPED_WMMA (#15400 ) * add SHAPED_WMMA * shaped wmma * less bad	2026-03-21 16:16:03 +08:00
George Hotz	41a9b09683	minimal vec in amd_copy_matmul (#15398 ) * minimal vec in amd_copy_matmul * unified * unify * reshape/permute * cleanups * simpler * move index * cleanups * more shared	2026-03-21 14:57:21 +08:00
George Hotz	1a2a203f48	add wmma support to amd_copy_matmul (#15384 ) * add wmma support to amd_copy_matmul * 15 TFLOPS and merged * unify * simpler * simpler * simpler * cleanups * TM/TN is the full regs * comments * WAVES_PER_SH + SQTT_EVENT * Add WAVERDY support * no split warp * 3 range	2026-03-20 19:02:19 +08:00
chenyu	da1700e16b	dtypes.index -> dtypes.weakint (#15377 )	2026-03-20 01:08:46 -04:00
George Hotz	4091d37e8e	flat llama step work (#15355 ) * flat llama step work * fp8 support * blacklisted matmul * chestertons fence	2026-03-20 09:06:12 +08:00
George Hotz	6e196195d8	add test for flat llama (#15327 ) * add test for flat llama * simpler * back to split w1/w3 * env * still too much ram * invalid	2026-03-18 15:16:33 +08:00
qazal	5cd1daa3bc	cdna asm_gemm in one file, remove old rdna3 asm (#15281 )	2026-03-16 04:32:30 +09:00
George Hotz	06d7cddb33	amd_copy_matmul is cleaner (#15248 ) * amd_copy_matmul is cleaner * it runs * replicated stuff * add tid there * it runs * cleanup * x.src[1] * flatten * move that * keep that assert	2026-03-14 12:56:09 +08:00
George Hotz	a7d2429c21	amd_uop_matmul more cleanups (#15240 )	2026-03-13 10:24:43 +08:00
George Hotz	e560a46f59	update amd_uop_matmul (#15236 ) * update amd_uop_matmul * use custom kernel * simpler * ignore	2026-03-12 17:33:12 +08:00
wozeparrot	c35de9bd68	asm_gemm: support more sharding (#15002 )	2026-03-02 23:16:37 -08:00
qazal	62ee976c1b	gemm/asm: cleanup repeated patterns to helper functions (#15094 )	2026-03-03 08:14:47 +09:00
qazal	448e997be4	gemm/asm: cleanup custom function args (#15007 )	2026-02-25 22:05:56 +09:00
qazal	f590564bf7	gemm multiple is only for cdna4 asm (#14814 ) * gemm multiple is only for cdna4 asm * move to backend * and arch * path	2026-02-17 14:00:02 +09:00
George Hotz	5bd2862d1a	late compile the cdna gemm (#14783 ) * late compile the cdna gemm * remove old things * finalize inplace --------- Co-authored-by: qazal <qazal.software@gmail.com>	2026-02-17 13:04:22 +09:00
George Hotz	f081f154ae	parameterize the CDNA asm gemm (#14813 ) * parameterize the CDNA asm gemm * fix llama test * fix * add more gemmt ests * confirm all match * test these asm gemms	2026-02-17 11:35:18 +08:00
qazal	c7a4dbf918	viz: get program binary from the UOp (#14787 ) * viz: get program binary from the UOp * remove that * less * rename View Program to View Source * two words * fix	2026-02-16 15:46:58 +09:00
George Hotz	dff9cf35c2	amd asm emulator fixes + run it in CI (#14786 ) * amd asm fix, try 2 * fix tests	2026-02-16 13:24:21 +08:00
qazal	55a4dfa2e0	cdna4 asm_gemm tests in CI on the null backend (#14785 ) * cdna4 asm_gemm tests in CI on the null backend * no .numpy() in null * better * gemm/asm: device comes from renderer	2026-02-16 14:06:23 +09:00
George Hotz	4088d686b2	remove llvm requirement from amd (#14717 ) * remove llvm requirement from amd * tests pass * test * sink kernarg_size * move stuff * amd_asm_matmul to new style * default type * fix tests, simpler * cu mode is faster and simpler * darken	2026-02-13 10:50:12 +08:00
George Hotz	4680247e35	renderer/amd: move in tree (#14702 ) * renderer/amd: move in tree * fix paths in tests * 24000 lines * no delete for amd files	2026-02-12 18:09:16 +08:00
George Hotz	befc1e800c	assembly/amd: disasm is test only (#14694 ) * assembly/amd: disasm is test only * viz uses str	2026-02-12 12:33:46 +08:00
George Hotz	3fab43c57c	add cache to asm gemm (#14675 )	2026-02-11 08:26:30 +08:00
qazal	80b0119cef	llama: add new asm gemm shape (#14611 ) * llama: add new asm gemm shape * work * cleanup * half dtype * more comment	2026-02-10 00:34:29 +09:00
George Hotz	183d38b128	remove CUSTOM_KERNEL / directly construct it (#14604 ) * remove CUSTOM_KERNEL / directly construct it * clean that up * simpler multi * custom kernel spec * remove Kernel * fix multi * use sharded shape * explicit regression test	2026-02-08 18:43:33 +08:00
qazal	cf73d7e2a7	hotfix: disable slower asm gemm shape from llama seqlen 8192 (#14582 )	2026-02-06 15:05:19 +09:00
George Hotz	43e7eda4e7	grad_b uses custom gemm (#14550 ) * grad_b uses custom gemm * fix multi backward, acc is in float32 * test_gemm_batched * square gemm --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com> Co-authored-by: qazal <qazal.software@gmail.com>	2026-02-05 15:22:27 +09:00
qazal	f9cfb64cd9	test asm_gemm in CI (#14551 ) * test asm_gemm in CI * default float16 * use a smaller shape for multi * smaller size * smaller for CI * smaller for ci * need half	2026-02-05 13:32:22 +09:00
chenyu	d57d24c7d4	Buffer.as_buffer -> Buffer.as_memoryview [pr] (#14535 ) it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data	2026-02-04 11:31:11 -05:00
qazal	d1bfbe9ce3	isolate slow llama gemm (#14525 )	2026-02-04 12:20:10 +09:00
qazal	a98c53769a	ASM_GEMM=1 runs the UOp gemm on non cdna (#14516 ) * ASM_GEMM=1 runs the UOp gemm on non cdna tests run on mac in 3 seconds * min diff	2026-02-03 20:42:02 +09:00
qazal	616e9c1483	CDNA assembly gemm in tensor.py with flag (#14310 ) * work * work * the assembly * remove the old one * remove ws bufs, assert splitk * notes cleanup * work * gemm args * gemm in mixins would be nice * add gemm gradient * print counters * the realize is for DEBUG=2 aesthetics * dedup * rewrite to python dsl, no list copies * leave that * add B, M, N, K to gemm name * it's M0 not NULL * fp16 support * test cleanup + more gemms * work from viz * more work * gemm batch_size * xccg path work * tiny comments on the label naming * s_waitcnt	2026-01-31 22:34:14 +09:00
qazal	d69bc5aa1a	make DEV=NULL EMULATE=AMD amd_asm_matmul run (#14460 )	2026-01-31 20:45:24 +09:00
George Hotz	e5df7e640b	fix branches in amd_asm_matmul (#14369 )	2026-01-27 20:48:42 +08:00
qazal	dfefeddeed	add tflops to cdna gemm custom kernel (#14281 )	2026-01-22 12:48:28 +09:00
George Hotz	79c1559f69	amd asm can still be simpler (#14199 ) * amd asm can still be simpler * simpler * V_LANE_ID * simpler * simpler * compact vgpr	2026-01-17 18:40:10 +09:00
George Hotz	50554115ee	fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul (#14196 ) * fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul * immed * wave override * restore ALT * advance sgprs correctly * no helpers * decrease to 192 VGPRs	2026-01-17 11:58:34 +09:00
George Hotz	8a2549d42b	improve amd_asm_matmul + minor VIZ PKTS improvements (#14186 ) * improve amd_asm_matmul + minor VIZ PKTS improvements * fix waitcnt issue * cleanups	2026-01-17 06:56:59 +09:00

1 2 3 4 5

227 Commits