Commit Graph

25 Commits

Author SHA1 Message Date
chenyu
720f20865b remove required_optimizations (#9848) 2025-04-19 16:51:16 -04:00
Sieds Lykles
07d1aefaf4 fast idiv (#9755)
* fast idiv with tests and fuzzer

* Add todo comment

* Add env variable to toggle fast_idiv

* Move env check

* Add fuzz fast_idiv to ci

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-07 08:32:24 -04:00
George Hotz
28e06d2d44 minor cleanups from patternmatcher [pr] (#9756) 2025-04-07 11:28:14 +08:00
George Hotz
b719aa1fb0 only check once for divisible fold lengths (#9732) 2025-04-04 11:27:34 +08:00
George Hotz
8206c7281e move const multiply after REDUCE (#9730) 2025-04-04 11:07:46 +08:00
George Hotz
cac8bcf8b5 use Ops.REDUCE (#9721)
* decrease bert python time [pr]

* order copies

* Revert "order copies"

This reverts commit 3f62c8693b.

* rewrite count

* Ops.REDUCE

* acc first in the add chain

* Fix tensor core acc

* arange patterns look good

* fix multireduce gate

* reduce rewrite rule

* bump that to 15 minutes

* multiwmma isn't fusing

* gep through wmma is gep pushing

* bump that timeout too, it's all env setup

* add failing test
2025-04-04 10:14:34 +08:00
George Hotz
5c7b549eab use functools.cache instead of lru_cache(None) [pr] (#9714)
* use functools.cache instead of lru_cache(None) [pr]

* more cache
2025-04-03 11:47:13 +08:00
George Hotz
1714fc3ba4 start work on speed [pr] (#9707)
* fix get_location

* fix get_location try 2

* clean up split_load_store [pr]

* SHR fixup [pr]
2025-04-03 10:39:01 +08:00
George Hotz
6f812d3f2f fixes from the dsp branch + 12500 lines (#9683)
* fixes from the dsp branch

* more changes

* those are gep pushing
2025-04-02 13:07:17 +08:00
George Hotz
49b1c46d16 good changes from the dsp branch (#9638) 2025-03-31 13:02:53 +08:00
George Hotz
d62ced8981 symbolic -> symbolic_flat (#9588) 2025-03-26 23:34:43 +08:00
George Hotz
8aaa5e1ec5 generate the individual indexes (#9587) 2025-03-26 22:32:06 +08:00
George Hotz
5c6cd884e3 multiple simplifies is faster [pr] (#9586)
* multiple simplifies is faster [pr]

* cleanup

* cleanup
2025-03-26 21:42:52 +08:00
qazal
0b20f91ce7 remove move_mask from the devectorizer (#9511)
* remove move_mask from the devectorizer

* add (wrong) ptx

* reason

* enable index addition in PTX, we won't have the INDEX anyways

* space
2025-03-20 11:53:12 +08:00
qazal
2223b93338 add UPat.or_casted [pr] (#9513) 2025-03-20 10:08:32 +08:00
George Hotz
824c5f41ac dsp work try 3 (#9475)
* dsp work try 3

* padding
2025-03-17 16:42:12 +08:00
George Hotz
242daa4f9a ptrcat (#9473) 2025-03-17 16:06:37 +08:00
George Hotz
52ae9af4dd Fast DSP for MobileNetV2 (try 2) (#9467)
* Fast DSP for MobileNetV2 (try 2)

* enable fast path on uchar

* fix tests
2025-03-17 15:10:36 +08:00
George Hotz
bfc68d1953 add gep rules to simplify (#9419)
* add gep rules to simplify

* ws

* flipped direction
2025-03-13 09:46:25 +08:00
George Hotz
5f6d5b057d expand index isn't grouping by access size (#9418)
* expand index isn't grouping by access size

* split_load_store

* scalar vec

* +correct_load_store

* vectorized and

* correct_load_store always

* simplify before divides
2025-03-12 17:24:10 +08:00
George Hotz
815ad0b7a8 support load/store grouping in DEVECTORIZE=0 (#9409) 2025-03-12 11:34:37 +08:00
George Hotz
e174c6c3bc new devectorizer (#9331)
* new devectorizer

* lidx

* test linearizer passes

* fix images

* fix unfoldable image load

* delete unused

* improve fix_unfoldable_image_load

* working for image

* fixup types

* fixup transcendental

* cast_vec

* cleaner transcendental

* skip failing test

* err, flip that

* not devec

* sqrt
2025-03-11 18:47:56 +08:00
Eitan Turok
d657d5f754 [Bounty] Vectorize Transcendental (#9058)
* init

* cast everythig right

* more casting

* install pillow in test

* quick tests

* simplify

* quick tests

* delete test

* tests

* fix import error

* add vec to ldexp3k

* vec for bitcast

* some helper tests

* high level tests

* clean tests

* change tolerance so cuda passes

* ruff passes

* remove tests for transcendental helpers

* ruff passes

* make exponent in power vectorized

* fix pow test

* add newline

* add vec dtype to ilogb2k

* comment + clean up

* ruff

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-28 15:47:25 +08:00
Sieds Lykles
9c4d9d9f10 Acc first (#9232)
* put acc in front of the add chain

* handle the other case

* Make loop collapse more generic

* Remove mulacc_unrolled

* Actually remove it

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-25 22:10:15 -05:00
George Hotz
df3b320f46 rewriter -> devectorizer [pr] (#9147) 2025-02-18 12:42:08 +08:00