Commit Graph

911 Commits

Author SHA1 Message Date
chenyu
7c942418a1 other side of simple out of bound valid case (#6552)
462 -> 455
2024-09-16 23:57:15 -04:00
chenyu
aeaf7894a7 more generic version of #6548 (#6549)
x*(-1)<0 can be generalized to x*(-1)<c, 473 -> 462 valids
2024-09-16 23:17:16 -04:00
chenyu
596f41eb46 simple drop image valid case (#6548)
* simple drop image valid case

started unit test, 530 -> 473 valids

* cleanup
2024-09-16 22:54:07 -04:00
George Hotz
c1b2472dea reorder alu/vectorize (#6538) 2024-09-16 14:28:14 +08:00
George Hotz
42ba887daa remove logic to vectorize reduces (#6536)
* remove logic to vectorize reduces

* fix tests
2024-09-16 14:04:48 +08:00
George Hotz
07bd6e070d add more uops tests for vmin/vmax/const_factor/divides (#6533) 2024-09-16 13:06:31 +08:00
ignaciosica
c447ec2190 Fix amx shape [run_process_replay] (#6524)
* fix amx shape (sz,sz,sz) -> (sz,sz,1)

* revert check
2024-09-16 09:49:55 +08:00
George Hotz
e1b21879a7 minor changes from new expand [run_process_replay] (#6528)
* minor changes from new expand [run_process_replay]

* explain that
2024-09-16 09:48:37 +08:00
chenyu
6be0cc387c _get_add_chain(x) -> _get_chain(x, BinaryOps.ADD) (#6523)
need MUL for valid [run_process_replay]
2024-09-15 10:54:13 -04:00
qazal
89b950c6b3 viz more work (#6517)
* infra

* actually replace the UOp

* extra per rewrite

* dont allow pyint
2024-09-15 16:42:17 +08:00
qazal
f69251c6b4 assert pyint in linearize_uop [run_process_replay] (#6518) 2024-09-15 16:29:05 +08:00
Tim Becker
7c078191ce Misc rewrite perf improvements (#6500)
* Make UOp a normal class and use __slots__

* Use __slots__ in UPat

* Cache dtypes.{min,max}

* Use faster iterables in ops.py

* extend is a lot faster than nested listcomp

Co-authored-by: Roelof van Dijk <3604013+roelofvandijk@users.noreply.github.com>

---------

Co-authored-by: Roelof van Dijk <3604013+roelofvandijk@users.noreply.github.com>
2024-09-13 11:31:50 +08:00
Tim Becker
8c4cab8d6e Even faster enums (#6483)
* Even faster enums

* simpler _generate_next_value impl

* FastEnum in ops only

* Better uniqueness for FastEnum
2024-09-12 20:08:02 +08:00
George Hotz
9543e4c92e more expand prereqs [run_process_replay] (#6499) 2024-09-12 17:46:12 +08:00
George Hotz
327eb12600 folding for vectorized consts [run_process_replay] (#6498)
* folding for vectorized consts [run_process_replay]

* remove that if statement

* inf loop
2024-09-12 17:29:37 +08:00
George Hotz
a532d59bbd gep tuple [run_process_replay] (#6495)
* gep tuple [run_process_replay]

* no inf loop, that goes in expander

* fix ops python

* unbreak gep 0

* fix tests

* fix tests

* VECTORIZE/GEP

* oops, broken
2024-09-12 16:37:31 +08:00
George Hotz
6dfa63cb21 more vconst stuff + gep tuple [run_process_replay] (#6494)
* more vconst stuff [run_process_replay]

* revert that

* fix inf loop
2024-09-12 14:58:14 +08:00
qazal
4507ab8016 more upat styling changes [run_process_replay] (#6492)
* more upat styling

* single to doulbe quotes

* wrap line

* comments
2024-09-12 14:40:16 +08:00
George Hotz
119b0ea4af add UOps.VCONST [run_process_replay] (#6487)
* add UOps.VCONST [run_process_replay]

* VCONST folding

* simpler devectorize

* alu

* revert that type
2024-09-12 14:03:39 +08:00
qazal
4dc9436d63 use more UPat.var and UPat.cvar [run_process_replay] (#6491) 2024-09-12 13:52:41 +08:00
George Hotz
76487a3533 remove nop, use upat [run_process_replay] (#6489)
* remove nop, use upat [run_process_replay]

* mypy passes

* no wonder nothing worked

* fixes
2024-09-12 12:16:19 +08:00
qazal
dda5c63f4a things we can delete after dtypes.void [run_process_replay] (#6480) 2024-09-11 19:21:41 +08:00
George Hotz
bdd0c06f29 add void type to uop (#6471)
* unwrap_dtype maybe

* uopgraph stuff that hardcoded None

* test_ops passes

* dtypes.py fixups

* update test_linearizer and friends

* more ast updates

* test_beam and test_schedule too

* add void type to uop [run_process_replay]

* remove dumb casts

* start making it green

* more cast cleanups

* more cls methods to fix

* regenerate dataset

* split UOp and NOp const

* maybe that too

* fix docs

* update test_uop_symbolic

* test_verify_ast

* new sops with no diff

* meh, type_ignore is alright

* remove that assert

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-11 18:16:28 +08:00
George Hotz
1b4d1823b7 add pyint to DTYPES_DICT [run_process_replay] (#6477)
* add pyint to DTYPES_DICT [run_process_replay]

* also fix uop alu bug

* exclude pyint there too

* ne ne

* force explicit dtype
2024-09-11 17:31:59 +08:00
George Hotz
1cadddee26 Revert "fold lt (#6472)" (#6473)
This reverts commit 81bda4d304.
2024-09-11 15:59:25 +08:00
George Hotz
81bda4d304 fold lt (#6472) 2024-09-11 15:56:57 +08:00
chenyu
d9d1ae7248 more lt folding using gcd (#6469) 2024-09-11 02:09:35 -04:00
chenyu
15c4d4f406 fold unrolled arange div pattern (#6465) 2024-09-10 22:35:52 -04:00
George Hotz
6d195fb653 small changes from new style expand [run_process_replay] (#6462) 2024-09-11 09:10:56 +08:00
qazal
1347e49e82 second iteration on UOps.SWIZZLE (#6451)
* new swizzle

* fix the failing tests

* test a double swizzle

* ci
2024-09-10 14:43:21 +08:00
chenyu
fcc69adfc5 simplify c0*x<c1 for negative int c0,c1 (#6431)
* simplify c0*x<c1 for negative int c0,c1

* fine if rhs is zero
2024-09-09 21:05:53 -04:00
George Hotz
dbd4536167 Revert "add UOps.VALID (#6387)" (#6441)
This reverts commit 8186e4e7d6.
2024-09-09 21:33:00 +08:00
George Hotz
42e5c8335e remove args from min/max [run_process_replay] (#6430)
* remove args from min/max [run_process_replay]

* it's a ConstType

* sconst_like unused

* any const is fine
2024-09-09 18:18:20 +08:00
George Hotz
8186e4e7d6 add UOps.VALID (#6387)
* uops valid

* broke full_shape

* fixup that st (hardcoded asts still red)

* fixup DEFINE_VAR

debug

more debug

* start moving stuff to ast_const

* move test_linearizer

* move test_linearizer_failures to ast_const

* fixup test_schedule

* small diff change

* regenerate dataset

* fixup test_multitensor

* regen dataset try 2

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-09 16:58:43 +08:00
chenyu
ac98f5056e move lt-folding to a function [run_process_replay] (#6422)
and added more tests (some failed to match symbolic)
2024-09-09 02:04:52 -04:00
George Hotz
90fb17304f put rewrite back in ops [run_process_replay] (#6421) 2024-09-09 13:53:51 +08:00
qazal
88941bcf16 fold bitwise noops (#6412)
from `8269a721cd6f5c6030ce120e1139095d7ba117eb`

Co-authored-by: timmy <timmy0x@proton.me>
2024-09-09 10:18:38 +08:00
George Hotz
282af21b95 hotfix: DEBUG_EXPAND -1 and NOOPT in benchmark schedule 2024-09-06 17:22:30 +08:00
qazal
f1bd2a5519 fix BUFFER_UOPS sts in verify_ast [run_process_replay] (#6389) 2024-09-06 16:59:22 +08:00
George Hotz
86d34daac9 UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383) 2024-09-06 12:38:35 +08:00
George Hotz
c88329244b create rewrite.py [run_process_replay] (#6379)
* create rewrite.py [run_process_replay]

* fix tests

* not in rewrite or ops

* skip flaky test
2024-09-06 10:51:01 +08:00
George Hotz
66e7e51c79 Revert beam failure (#6376)
* Revert "late gate creation for STORE [run_process_replay] (#6373)"

This reverts commit c26744de9f.

* Revert "gated store rewrite to UOps.IF (#5976)"

This reverts commit 48061e8400.
2024-09-06 09:36:44 +08:00
ignaciosica
c15506fc35 [WIP] amx support as TC (#5693)
* almost working with relu, even hackable... but acc size is wrong, fix needed

* upcast based on threads, change thread size to 4x4

* revert wrongfully commented assert

* fix tc load indexing

* modify for size 8

* fix bug for size 8

* Revert "fix bug for size 8"

This reverts commit cdb3f5df85.

* Revert "modify for size 8"

This reverts commit 3ef0904bd9.

* good kernel with changes in lowerer

* revert "good kernel with changes in lowerer"

This reverts commit 975e2b5a4e.

* good kernel for relu!

* refactor lowerer changes

* add amx context var to helper

* clean up amx flag

* improve lowerer changes readability

* improve check for amx

* revert lowerer if

* add float4 type rendering for clang

* add amx definitions

* enable indexing for clang if amx

* working amx example, wrong because of dims

* almost works for float 16, need to spot using double load in amx

* cleaner render_kernel

* revert chages in simple_matmul and delete env

* add new var upcast_offset to get_optimized_ast

* change axis for axes

* invert if in rendering phi

* fix some bugs

* fix linearizer tests

* fix vec/get pat for amx

* remove clang tc if amx is disabled

* add ops_python support

* refactor into one complementary function in ops_python

* add job for EMUALTE_AMX

* improve checking for AMX in UPCAST and TC extra ops

* fix lint issue

* commit before refactor into autocontained AMX

* start refactor by removing special rendering for AMX

* all ready for amx handcoded kernel

* working poc, most straightforward amx support

* avoid local opts for tc if amx

* fix merge bugs

* skip test for clang

* skip tc hand-coded opts if amx

* remove hardcoded ops_python values

* remove hardcoded sizes for amx kernel

* fix ops_python bug where dim was hard-coded

* change contract for vectorize

* working without changes in lowerer

* revert changes in gep rendering

* fix ops_python

* modify comment

* skip test if clang for different type accumulation

* move rename and bug for seperate pr

* fix wrong path for test

* addmm not implemented in torch for cpu

* change struct for vector; equally slow but cleaner

* revert modified test

* simply wmma rendering

* minor change

* noqa:501

* add length 16 for AMX

* fix vectorized half issue

* fix error

* remove comment

* change set for dedup

* split test of tensor_core_extra_ops so that cases that dont require locals run for AMX

* add amx reference

* load acc into amx registers

* fix dtype rendering and remove noqa

* moved tests change into another pr

* add real AMX job for CI and fix bug

* fix ops_python bug

* fix test class

* remove real AMX tests and fix uops_stats test

* remove wrong test

* acc folding

* hotfix: bug

* fix float4 tests for amx

* hack for fixing flops counting

* hotfix: mypy

* add flop counts test for amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_amx

* improve test_float4_multidim_unaligned_load_amx

* nits tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-09-06 09:01:10 +08:00
qazal
c26744de9f late gate creation for STORE [run_process_replay] (#6373) 2024-09-06 03:32:19 +08:00
Ian Paul
48061e8400 gated store rewrite to UOps.IF (#5976)
* Core change to gate stores in IFs

* Updates to cstyle renderer to handle IFs around STOREs

* Make uops asserts happy

* Add tests and fix newly broken tests

* make ruff happy

* make mypy happy

* Simplify renderer to have all gated stores use IF

* Revert some changes

* Make test_where_fold happy

* Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out

* Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE

* Re-change broken test

* Make ifs be grouped together

* get non-merged IFs working. ALl tests pass except grouping related ifs together

* Fix tests by making the IF UOp dependent on the correct node of the STORE UOp

* Changes to uopgraph

* Simplify graph rewrite logic

* Changes to get test_padto_where_multireduce working

* Simplify uops.store renderer

* Make test_padto_where_multireduce pass but now other tests fail

* Clean up uopgraph from scrach work

* Ignore sudo IF srcs when rendering

* Attempt to fix llvm tests

* rm comment

* reduce lines

* Add line to make mypy happy :(

* llvmir fix pt 1

* Mods after rebasing to master

* Fix llvmir

* Fix ptx tests

* Fix other ptx tests

* Move changes from uops.py to ops.py

* rm uops.py

* Fix TestGateStoreRewrite tests

* Get multireduce tests working

* reset to remote branch

* Fix linearizer tests

* uop_graph test patch

* Add comment to create_gate

* hotfix: uncomment those tests

* Attempt to fix ptx tests by including whitespace inside if block

* Patch from remote tinybox. Tests passing here

* Min changes to get some ptx tests passsing

* Changes after rebase

* Exclude ifs and endifs from ptx

* IF conditional branching within ptx

* Save lines on delete_redundant_gates

* Simplify merge_gates

* rm noqa

* Remove unnecessary checks when merging gates

* Fix ops error msg

* Smarter check for if/endif in llvmir

* simplify delete redundant gates to only have 2 returns

* spacing

* Smarter check at beginning of merge_gates

* patches from comments

* Remove need for merge_gates

* include proper srcs in IF from the get-go

* test expand ifs dumb will result in 4 ifs, not 1 now

* Make tests happy

* Fix uops stats

* rm merge_gates method. Will add back in separate PR

* Spacing

* cleaner error msg

* Fix uops rendering when expanding. test_failure_43

* patch tests

* undo changes in delete_redundant_gates

* process replay attempt

* re-intro deletion of redundant gates

* fix addition of gates when they get nested in stores and loads

* patch tests

* smarter init of IF srcs when adding gate to STORE

* make ruff happy

* Resp to comment

* include all src[2]'s srcs in IF for gated store

* add reference of the storing value to the gate's src

* minor patch after rebasing

* change ptx renderer

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2024-09-06 01:05:30 +08:00
George Hotz
e882294c02 uops touchups [run_process_replay] (#6368)
* uops touchups [run_process_replay]

* those are classmethods

* oops, kwargs

* no kwargs there
2024-09-05 17:22:32 +08:00
George Hotz
4a51c28ee7 switch const to const_like [run_process_replay] (#6356)
* const like

* no more _const

* missed one

* mypy ops.py

* file missing

* const_like

* fix image and test uop graph [run_process_replay]

* fix ptx
2024-09-05 13:57:54 +08:00
chenyu
6fd24561d1 distribute MUL const into ADD for int (#6361)
pre-req for real_stride
2024-09-05 01:36:57 -04:00
George Hotz
72be31cb56 remove mla [run_process_replay] (#6357)
* remove mla

* other bad uses of const
2024-09-05 10:37:46 +08:00
George Hotz
365babe391 precompute early_reject [run_process_replay] (#6327)
* precompute early_reject [run_process_replay]

* features for ebs

* fix ocelot cache
2024-08-29 18:26:24 -07:00