* search: fix edge cases on screening potential ops
won't change correctness, but will save a little python time by
properly deduplicating potential actions
* check for de-duplication instead of exact valid actions
* refactor long line
* kernel: change PADTO check to allow up to 4x padding
also optionally remove PADTO from the search action space with
BEAM_PADTO=0.
* fix test_linearizer test_tensor_cores_padded tests
* update resnet runs to use SPLIT_REDUCEOP=1
* fix up search TC axis and amt checking
* fix up the dimensions of the TC tests
* wmma: widen TC usage in search by using PADTO on TC axes when possible
* test: start tests for the new padding TC behavior
* search: upgrade padded TC search to TC_OPT >= 2
* test: add behavior and correctness test for padded TC
added optional argument to apply_tensor_core to set TC_OPT level
* linearizer: add tests for the PADTO behvaior and docs
* tensor cores
* Merge from master
* faster program start in llvm (#3897)
* Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.
* Delete stray line used for breaking tests
* Fix linter error by renaming twice-used variable
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* touchup einsum (#3900)
don't need rhs_letters
* hotfix check ckpts before writing achieved model (#3901)
this killed tinybox green run
* replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
* add --minimal flag to nvrtc (#3899)
* wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* training cifar with BF16 on CUDA (#3905)
* training cifar with BF16 on CUDA
memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float.
* simpler bf16 functions
* bf16 cifar works for HSA too just very slow
* simpler bf16 functions, we love cuda
* include negative float in test_dtype (#3884)
* include negative float in test_dtype
* that is ub
* too annoying
* pack can overflow
* add to benchmark
* change var name to satisfy mypy
* spacing
* Update to new TensorCore format
* Spacing
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com>
Co-authored-by: Francis Lam <flam@alum.mit.edu>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias. now it splits it properly into 8 and the
remaining 2 into the correct local stride
* remove HIP in core tinygrad
ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag
* EMULATE_CUDA
be more specific about invalid kernel opt, used that in test_linearizer_failures.
make BEAM kernel search work even with assertion disabled.
`BEAM=2 python3 -O examples/llama.py --temperature=0 --count=10 --prompt="Hello." --timing`
* search: add tensor core to beam search space
* kernel: refactor apply_tensor_core into apply_opt and hand_coded
* kernel: revert removal of apply_tensor_cores
also revert BEAM search parameter changes
Fully UNROLLing the first_reduce should not change the number of
local_dims.
Fully UNROLLing a GROUP dim should reduce the number of
group_for_reduces by one.
Also changed group_for_reduces to be a count as the axis number
isn't used anywhere (they are always the first reduce dims).
* start uop emu
* tiny_add passes
* more ops
* emulate the whole warp
* test_gemm passes
* metal gemm test pass
* works on big gemm
* works on big gemm
* more tests pass
* touch ups
* fix mypy
* cleanups
* exp2 mypy
* arch is where it belongs
* actually emulate tensor cores
* fix test
* new style