Commit Graph

114 Commits

Author SHA1 Message Date
George Hotz
2f970a4fc2 all realize 2 (#4527)
* all realize 2

* tests fixup

* fix more tests

* fix openpilot

* fix tests

* unneeded
2024-05-10 22:43:09 -07:00
George Hotz
347a3acb37 add renderer class (#4524)
* add renderer class

* tests pass

* fix pylint

* fix tensor cores
2024-05-10 21:40:02 -07:00
qazal
00c309dfe2 trigger tc in remu (#4479) 2024-05-08 23:23:46 +03:00
Francis Lam
47750e65fd kernel: un-reverse the order of the local indices (#4454)
no change to performance or behavior.  new LOCALS are added to the
left side of the LOCALS block (to the left of the first_reduce).
2024-05-06 15:21:27 -04:00
Timmy
3f3c973022 Multiple Reduce Kernels - kernel properly orders reduceops (#4418)
* enable kernel with multiple reduceops

* copy self.reduceops

* assert only one reduceop per kernel

* kernel.py dfs order

* linters

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-05-06 13:54:44 +03:00
chenyu
afe020710d disable PADTO on upcasted axis (#4444)
fixed test_failure_31. PADTO upcasted is at best a no-op, and might fail at edge cases.
2024-05-05 21:52:03 -04:00
chenyu
d4062cb6fc NV tensor_cores in kernel.py (#4399) 2024-05-02 22:33:08 -04:00
Francis Lam
5c5b40880f search: fix edge cases on screening potential ops (#4394)
* search: fix edge cases on screening potential ops

won't change correctness, but will save a little python time by
properly deduplicating potential actions

* check for de-duplication instead of exact valid actions

* refactor long line
2024-05-02 14:53:05 -04:00
Francis Lam
0d33c54d99 kernel: change PADTO check to allow up to 4x padding (#4354)
* kernel: change PADTO check to allow up to 4x padding

also optionally remove PADTO from the search action space with
BEAM_PADTO=0.

* fix test_linearizer test_tensor_cores_padded tests

* update resnet runs to use SPLIT_REDUCEOP=1

* fix up search TC axis and amt checking

* fix up the dimensions of the TC tests
2024-04-30 15:29:34 -04:00
Francis Lam
a9a1fa6bbf wmma: add reduce axis choice to TC action space (#4328)
* wmma: add reduce axis choice to TC action space

* add test for TC multi-reduce axis choice
2024-04-29 19:15:39 -04:00
Francis Lam
1f2642c73b kernel: fix calculation of smem size to ignore UNROLL (#4308)
* kernel: fix calculation of smem size to ignore UNROLL

* simplify prod array
2024-04-26 14:34:56 -04:00
George Hotz
9a95781d51 renamed (#4260) 2024-04-23 09:00:28 +04:00
Francis Lam
bbb0ad4800 wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216)
* wmma: widen TC usage in search by using PADTO on TC axes when possible

* test: start tests for the new padding TC behavior

* search: upgrade padded TC search to TC_OPT >= 2

* test: add behavior and correctness test for padded TC

added optional argument to apply_tensor_core to set TC_OPT level

* linearizer: add tests for the PADTO behvaior and docs
2024-04-22 16:50:31 -04:00
chenyu
a1133beb80 KFD GEMM (#4221)
added to benchmark CI and fixed duplicated filenames between cuda and ptx
2024-04-19 00:43:18 -04:00
George Hotz
599eb266b1 optionally use a copy kernel instead of SDMA (#4116)
* optionally use a copy kernel

* lazyops in copied kernels

* add sync

* no sdma at all

* work

* copy_ast
2024-04-12 23:10:41 -07:00
chenyu
06bcae13b4 PADTO SUM if parents of sum are all zero-preserving (#4140)
* PADTO SUM if parents of sum are all zero-preserving

* test case unsafe ops after sum is fine

* reuse UNSAFE_PAD_OPS

* update db version
2024-04-10 22:16:12 -04:00
chenyu
c1cffed1df add LazyOp.dtype (#4073)
an inferred cached_property.
removed all cases that use get_lazyop_info just to get the dtype of an op.
prereq to remove InterpretedFlopCounter
2024-04-04 17:38:19 -04:00
Szymon Ożóg
68fe3527f1 Tensor core ptx (#3894)
* tensor cores

* Merge from master

* faster program start in llvm (#3897)

* Fix the result permutation in einsum (#3895)

* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>

* touchup einsum (#3900)

don't need rhs_letters

* hotfix check ckpts before writing achieved model (#3901)

this killed tinybox green run

* replace dtype.name str with render_dtype (#3903)

fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override

* add --minimal flag to nvrtc (#3899)

* wmma: fix the AMD TC threads to split the first 16 threads (#3904)

previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride

* training cifar with BF16 on CUDA (#3905)

* training cifar with BF16 on CUDA

memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.

* simpler bf16 functions

* bf16 cifar works for HSA too just very slow

* simpler bf16 functions, we love cuda

* include negative float in test_dtype (#3884)

* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow

* add to benchmark

* change var name to satisfy mypy

* spacing

* Update to new TensorCore format

* Spacing

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-04-04 07:32:31 -07:00
chenyu
d9ff636cf5 use is to compare with enum (#3993)
* use is to compare with enum

currently it's mixed between `==` and `is`, moved all to `is`

* more
2024-03-29 13:02:56 -04:00
chenyu
b47f6cebb2 LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
Francis Lam
7c5729a3bd wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
chenyu
f6ff76be21 check only upcast int amount in upcasted_axis (#3938)
fixed typing and fixed #3932
2024-03-26 12:54:57 -04:00
chenyu
10673d1447 tiny search cleanup (#3910)
* tiny search cleanup

removed some `assert isinstance(dev, Compiled)` and lines

* remove import
2024-03-24 14:20:55 -04:00
Francis Lam
0145366323 wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu
5dd048a378 remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
qazal
e3e89c244b multioutput uoping infra (#3706)
* linearize multioutput

* add vars to copy
2024-03-15 21:56:59 -07:00
chenyu
90e55a9fd1 fix buf_index not found case in _apply_tc_opt (#3739)
ValueError if src.src[0] is not a LOAD. Replaced with returning None in _apply_tc_opt and test to make sure the net output is KernelOptError.
2024-03-14 14:27:05 -04:00
Francis Lam
b6e2495fdd kernel: limit shared memory usage when adding opts (#3705)
* kernel: limit shared memory usage when adding opts

* search: remove unnecessary limit on search space

apply_opt will do the more correct check
2024-03-12 17:06:21 -04:00
qazal
aec4c4f01b linearizer ast as a tuple of lazyops (#3689)
* multi store op linearizer

* currently we do only one output per kernel

* named opts
2024-03-11 15:39:04 -07:00
chenyu
915f98791c use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00
George Hotz
6e50582e62 working to improve ptx (#3647)
* working to improve ptx

* fix compile fail
2024-03-07 12:39:31 -08:00
nimlgen
3db826e195 hsa in lin opts (#3602) 2024-03-04 06:17:32 -08:00
Francis Lam
7c90005c65 search: hotfix to make sure TC behavior is all in applied_opts (#3598)
* search: hotfix to make sure TC behavior is all in applied_opts

* fix linter error

* fix mypy
2024-03-03 21:44:38 -05:00
Francis Lam
9642a8f547 search: add BEAM UPCAST/LOCAL params and loosen TC criteria during BEAM (#3563) 2024-03-02 03:11:25 -08:00
Francis Lam
e17f1821a7 wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544) 2024-03-01 17:51:02 -08:00
Francis Lam
5d434801fa search: add tensor core to beam search space (#3275)
* search: add tensor core to beam search space

* kernel: refactor apply_tensor_core into apply_opt and hand_coded

* kernel: revert removal of apply_tensor_cores

also revert BEAM search parameter changes
2024-02-29 13:05:10 -08:00
George Hotz
7698781389 Revert "wmma: add CUDA tensor core (#3464)" (#3474)
This reverts commit e9cef13f0b.
2024-02-22 11:58:16 +01:00
Francis Lam
e9cef13f0b wmma: add CUDA tensor core (#3464) 2024-02-22 11:57:08 +01:00
chenyu
230fc33d5b limit sint to be Union[int, Variable, MulNode, SumNode] (#3430)
* limit sint to be Union[int, Variable, MulNode, SumNode]

these are the only allowed nodes in a Tensor shape

* stride can be sint
2024-02-16 10:05:46 -05:00
George Hotz
a40df14fef ops_ext to replace cpu import (#3409)
* ops_ext to replace cpu import

* don't allow zero copy with as buffer

* memoryview(bytearray

* reenable test

* fix jit issue
2024-02-15 13:03:42 +01:00
George Hotz
6356474d6d Revert "ops_ext to replace cpu import (#3406)" (#3408)
This reverts commit 91eb93f85a.
2024-02-15 12:16:10 +01:00
George Hotz
91eb93f85a ops_ext to replace cpu import (#3406)
* ops_ext to replace cpu import

* don't allow zero copy with as buffer

* memoryview(bytearray

* reenable test
2024-02-15 12:14:58 +01:00
Francis Lam
668324d92b wmma: protect TC locals from modification and use only LOCAL (#3379)
also remove unnecesssary upcast_dim from tensor_core and calculate
it from the dimensions and thread sizes
2024-02-13 10:19:35 +01:00
Francis Lam
ddb22a60c8 linearizer: fix up edge case bugs in UNROLL opt (#3362)
Fully UNROLLing the first_reduce should not change the number of
local_dims.

Fully UNROLLing a GROUP dim should reduce the number of
group_for_reduces by one.

Also changed group_for_reduces to be a count as the axis number
isn't used anywhere (they are always the first reduce dims).
2024-02-10 11:49:25 +01:00
George Hotz
c32ea95d7d Python uop emulator (#3327)
* start uop emu

* tiny_add passes

* more ops

* emulate the whole warp

* test_gemm passes

* metal gemm test pass

* works on big gemm

* works on big gemm

* more tests pass

* touch ups

* fix mypy

* cleanups

* exp2 mypy

* arch is where it belongs

* actually emulate tensor cores

* fix test

* new style
2024-02-08 19:24:55 +01:00
Francis Lam
927f2dd24d wmma: add HIP FP16 to FP16 tensor core (#3287)
* wmma: add HIP FP16 to FP16 tensor core

* test: fix test_tensor_core to use separate tolerances for half
2024-01-31 23:00:51 -05:00
George Hotz
9e17378b60 Fix metal tests (#3266)
* small fixes for tests on mac

* remove device from TensorCore
2024-01-27 18:09:42 -08:00
George Hotz
3c728d1082 compiler support (#3260)
* compiler support

* revert that

* fix tests
2024-01-26 23:36:40 -08:00
George Hotz
83d614295e reduce lines (#3230) 2024-01-24 10:35:59 -08:00
chenyu
e139ae550d smaller limit_dims_to_max (#3167)
same questionable logic, but less lines now
2024-01-18 13:02:20 -05:00