Commit Graph

1857 Commits

Author SHA1 Message Date
qazal
2363bceb47 viz: no context enters in cli, update llama profile (#15404) 2026-03-22 05:47:02 +09:00
George Hotz
c13d9d29ff add SHAPED_WMMA (#15400)
* add SHAPED_WMMA

* shaped wmma

* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683 minimal vec in amd_copy_matmul (#15398)
* minimal vec in amd_copy_matmul

* unified

* unify

* reshape/permute

* cleanups

* simpler

* move index

* cleanups

* more shared
2026-03-21 14:57:21 +08:00
qazal
71ccc69c52 FP8=1 llama works again, hipcc can run on macos (#15394)
* hipcc macos shim

* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
George Hotz
1a2a203f48 add wmma support to amd_copy_matmul (#15384)
* add wmma support to amd_copy_matmul

* 15 TFLOPS and merged

* unify

* simpler

* simpler

* simpler

* cleanups

* TM/TN is the full regs

* comments

* WAVES_PER_SH + SQTT_EVENT

* Add WAVERDY support

* no split warp

* 3 range
2026-03-20 19:02:19 +08:00
chenyu
c491345766 pass device into Tensor._frompy (#15385)
* pass device into Tensor._frompy

with this, canonicalize_device is the only usage of Device in tensor.py

* export_model.py
2026-03-20 05:09:01 -04:00
chenyu
da1700e16b dtypes.index -> dtypes.weakint (#15377) 2026-03-20 01:08:46 -04:00
George Hotz
4091d37e8e flat llama step work (#15355)
* flat llama step work

* fp8 support

* blacklisted matmul

* chestertons fence
2026-03-20 09:06:12 +08:00
George Hotz
70dad9d642 add PING to RemoteCmd (#15371)
* add PING to RemoteCmd

* cleanup
2026-03-19 18:57:40 +08:00
nimlgen
ff004d2114 remote: fix mmio (#15347) 2026-03-18 18:20:39 +08:00
George Hotz
6e196195d8 add test for flat llama (#15327)
* add test for flat llama

* simpler

* back to split w1/w3

* env

* still too much ram

* invalid
2026-03-18 15:16:33 +08:00
nimlgen
0315faf938 remote bench (#15331) 2026-03-18 14:03:51 +08:00
wozeparrot
b45edeb965 fix: rand supports large tensors (#15329) 2026-03-17 15:45:41 -07:00
qazal
00817cf65e viz: all tests can run on the NULL device (#15328)
* remove that

* move to test_viz

* get_cfg

* do not use os.environ

* hm

* it's always on NULL

* import renderer

* no import *
2026-03-18 04:14:20 +09:00
nimlgen
0a641ce17d system: remote (#15318)
* system: remote

* listen

* print

* fix

* minor
2026-03-17 19:25:37 +08:00
nimlgen
a50fdb0528 nvcc macos (#15308)
* fix nvcc install macos

* um

* arm

* per

* tm
2026-03-17 17:25:33 +08:00
nimlgen
e1c2d09720 system: rebar to remote devs (#15316) 2026-03-17 16:09:12 +08:00
nimlgen
27e29127b5 system: remote prereqs (#15290)
* x

* new format for apl

* this

* typing

* rpc

* tuple

* linter+new tinygpu
2026-03-16 18:45:41 +08:00
nimlgen
e7705fe311 system: pcidev doesn't care about bars (#15284) 2026-03-16 14:45:43 +08:00
nimlgen
ff0bcc8de0 system: iface p1 changes (#15278) 2026-03-16 10:48:25 +08:00
qazal
4445f50356 viz: variable duration rdna barriers (#15277)
* viz: variable length rdna barriers

* work

* tiny changes

* simple wave simd test

* small wave sync test

* good multi barrier bug find

* simple fix

* wave_sync asserts

* rdna4 work

* more rdna4

* find more bugs in my model

* it's so much simpler

* wave_sync tests duration

* r4

* should just call this rdna4
2026-03-16 06:06:19 +09:00
qazal
5cd1daa3bc cdna asm_gemm in one file, remove old rdna3 asm (#15281) 2026-03-16 04:32:30 +09:00
qazal
7b6211fdd7 sqtt: remove discover_ops script (#15279) 2026-03-15 22:17:06 +09:00
qazal
3858bfc83d sqtt: CDNA inst decodes (#15274)
* sqtt: CDNA inst decodes

* JUMP packets other way

* cdna insts

* r3

* r4

* lds from simd1 and simd2
2026-03-14 21:03:46 +09:00
George Hotz
06d7cddb33 amd_copy_matmul is cleaner (#15248)
* amd_copy_matmul is cleaner

* it runs

* replicated stuff

* add tid there

* it runs

* cleanup

* x.src[1]

* flatten

* move that

* keep that assert
2026-03-14 12:56:09 +08:00
nimlgen
bc16f80b50 am: remove dma_regions param (#15251)
* am: remove dma_regions param

* linter
2026-03-13 18:12:48 +08:00
George Hotz
a7d2429c21 amd_uop_matmul more cleanups (#15240) 2026-03-13 10:24:43 +08:00
George Hotz
e560a46f59 update amd_uop_matmul (#15236)
* update amd_uop_matmul

* use custom kernel

* simpler

* ignore
2026-03-12 17:33:12 +08:00
chenyu
842c978df3 remove staticmethod dtypes.max/min (#15227)
always use x.dtype.max/min
2026-03-11 23:11:24 -04:00
qazal
d3eef70162 viz: render shader clock frequency graph (#15197) 2026-03-12 01:32:49 +09:00
nimlgen
086081e35b tbgpu: add stapler to the script (#15180) 2026-03-07 00:07:27 +03:00
qazal
83f1faa142 sqtt: update CDNA wave packet field, start unskipping tests (#15168)
* correct field names

* packet types

* packet 5 is regc

* test skips
2026-03-06 21:37:44 +09:00
Roelof van Dijk
d65923bda5 tensor.py: add normalize function (#15159)
* tensor.py: add normalize function

* p==0 should match torch
2026-03-05 18:55:53 +08:00
wozeparrot
be23772d43 llama3 fixes part2 (#15150) 2026-03-04 23:43:50 -08:00
qazal
33a1970045 sqtt: simplify inst mapping, validate JUMP processing in CI (#15139)
* jump cleanup

* assert there's a JUMP

* new example for JUMP

* regenerate examples

* rdna4 work

* new packets

* work

* less for branch handling

* less verbose

* fix err message
2026-03-05 09:53:12 +09:00
nimlgen
cdc48da9cd hevc: assert and speed (#15122)
* hevc: assert and speed

* simpler
2026-03-04 19:01:02 +03:00
wozeparrot
4e9b85ecfd fa: pull inputs out of call (#15127) 2026-03-04 03:15:49 -08:00
George Hotz
8ebd24637b fix fa forward building with clang 22 (#15124)
* fix fa forward building with clang 22

* fix: override rocm path

---------

Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-03-04 02:32:25 -08:00
wozeparrot
df23057984 fa: change bwd grid dim + unshuffle using mops (#15068) 2026-03-04 01:23:40 -08:00
qazal
8dd691761d sqtt: remove old files (#15108) 2026-03-03 22:43:24 +09:00
Christopher Milan
de043226ba benchmark comma usbgpu driving_vision step and load time (#15103)
Co-authored-by: Comma Device <device@comma.ai>
2026-03-03 06:08:03 -05:00
wozeparrot
c35de9bd68 asm_gemm: support more sharding (#15002) 2026-03-02 23:16:37 -08:00
qazal
62ee976c1b gemm/asm: cleanup repeated patterns to helper functions (#15094) 2026-03-03 08:14:47 +09:00
nimlgen
dfa180413d tbgpu: sign nv (#15087) 2026-03-02 22:58:30 +03:00
chenyu
71f228f80f test exact kernel count in torch_backend/test_kernel_fusion (#15091) 2026-03-02 14:26:32 -05:00
qazal
f7aeff6061 viz: cli.py cleanups, do not require PYTHONPATH (#15085)
* cleanup the print

* sys.exit

* equal check

* cleanup unpacker

* cli doesn't need PYTHONPATH

* no semicolons

* %s/PYTHONPATH=. //g
2026-03-02 19:24:38 +09:00
qazal
b8a55d5f68 sqtt: new packet types, add discovery script (#14960) 2026-02-28 04:27:27 +09:00
qazal
448e997be4 gemm/asm: cleanup custom function args (#15007) 2026-02-25 22:05:56 +09:00
wozeparrot
8d9545e09e llama3: correctly shard wqkv (#14978) 2026-02-23 23:57:10 -08:00
wozeparrot
25565b2410 fa: test for mp (#14907) 2026-02-22 21:47:36 -08:00