chenyu
720f20865b
remove required_optimizations ( #9848 )
2025-04-19 16:51:16 -04:00
Sieds Lykles
07d1aefaf4
fast idiv ( #9755 )
...
* fast idiv with tests and fuzzer
* Add todo comment
* Add env variable to toggle fast_idiv
* Move env check
* Add fuzz fast_idiv to ci
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-07 08:32:24 -04:00
George Hotz
28e06d2d44
minor cleanups from patternmatcher [pr] ( #9756 )
2025-04-07 11:28:14 +08:00
George Hotz
b719aa1fb0
only check once for divisible fold lengths ( #9732 )
2025-04-04 11:27:34 +08:00
George Hotz
8206c7281e
move const multiply after REDUCE ( #9730 )
2025-04-04 11:07:46 +08:00
George Hotz
cac8bcf8b5
use Ops.REDUCE ( #9721 )
...
* decrease bert python time [pr]
* order copies
* Revert "order copies"
This reverts commit 3f62c8693b .
* rewrite count
* Ops.REDUCE
* acc first in the add chain
* Fix tensor core acc
* arange patterns look good
* fix multireduce gate
* reduce rewrite rule
* bump that to 15 minutes
* multiwmma isn't fusing
* gep through wmma is gep pushing
* bump that timeout too, it's all env setup
* add failing test
2025-04-04 10:14:34 +08:00
George Hotz
5c7b549eab
use functools.cache instead of lru_cache(None) [pr] ( #9714 )
...
* use functools.cache instead of lru_cache(None) [pr]
* more cache
2025-04-03 11:47:13 +08:00
George Hotz
1714fc3ba4
start work on speed [pr] ( #9707 )
...
* fix get_location
* fix get_location try 2
* clean up split_load_store [pr]
* SHR fixup [pr]
2025-04-03 10:39:01 +08:00
George Hotz
6f812d3f2f
fixes from the dsp branch + 12500 lines ( #9683 )
...
* fixes from the dsp branch
* more changes
* those are gep pushing
2025-04-02 13:07:17 +08:00
George Hotz
49b1c46d16
good changes from the dsp branch ( #9638 )
2025-03-31 13:02:53 +08:00
George Hotz
d62ced8981
symbolic -> symbolic_flat ( #9588 )
2025-03-26 23:34:43 +08:00
George Hotz
8aaa5e1ec5
generate the individual indexes ( #9587 )
2025-03-26 22:32:06 +08:00
George Hotz
5c6cd884e3
multiple simplifies is faster [pr] ( #9586 )
...
* multiple simplifies is faster [pr]
* cleanup
* cleanup
2025-03-26 21:42:52 +08:00
qazal
0b20f91ce7
remove move_mask from the devectorizer ( #9511 )
...
* remove move_mask from the devectorizer
* add (wrong) ptx
* reason
* enable index addition in PTX, we won't have the INDEX anyways
* space
2025-03-20 11:53:12 +08:00
qazal
2223b93338
add UPat.or_casted [pr] ( #9513 )
2025-03-20 10:08:32 +08:00
George Hotz
824c5f41ac
dsp work try 3 ( #9475 )
...
* dsp work try 3
* padding
2025-03-17 16:42:12 +08:00
George Hotz
242daa4f9a
ptrcat ( #9473 )
2025-03-17 16:06:37 +08:00
George Hotz
52ae9af4dd
Fast DSP for MobileNetV2 (try 2) ( #9467 )
...
* Fast DSP for MobileNetV2 (try 2)
* enable fast path on uchar
* fix tests
2025-03-17 15:10:36 +08:00
George Hotz
bfc68d1953
add gep rules to simplify ( #9419 )
...
* add gep rules to simplify
* ws
* flipped direction
2025-03-13 09:46:25 +08:00
George Hotz
5f6d5b057d
expand index isn't grouping by access size ( #9418 )
...
* expand index isn't grouping by access size
* split_load_store
* scalar vec
* +correct_load_store
* vectorized and
* correct_load_store always
* simplify before divides
2025-03-12 17:24:10 +08:00
George Hotz
815ad0b7a8
support load/store grouping in DEVECTORIZE=0 ( #9409 )
2025-03-12 11:34:37 +08:00
George Hotz
e174c6c3bc
new devectorizer ( #9331 )
...
* new devectorizer
* lidx
* test linearizer passes
* fix images
* fix unfoldable image load
* delete unused
* improve fix_unfoldable_image_load
* working for image
* fixup types
* fixup transcendental
* cast_vec
* cleaner transcendental
* skip failing test
* err, flip that
* not devec
* sqrt
2025-03-11 18:47:56 +08:00
Eitan Turok
d657d5f754
[Bounty] Vectorize Transcendental ( #9058 )
...
* init
* cast everythig right
* more casting
* install pillow in test
* quick tests
* simplify
* quick tests
* delete test
* tests
* fix import error
* add vec to ldexp3k
* vec for bitcast
* some helper tests
* high level tests
* clean tests
* change tolerance so cuda passes
* ruff passes
* remove tests for transcendental helpers
* ruff passes
* make exponent in power vectorized
* fix pow test
* add newline
* add vec dtype to ilogb2k
* comment + clean up
* ruff
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-28 15:47:25 +08:00
Sieds Lykles
9c4d9d9f10
Acc first ( #9232 )
...
* put acc in front of the add chain
* handle the other case
* Make loop collapse more generic
* Remove mulacc_unrolled
* Actually remove it
---------
Co-authored-by: George Hotz <geohot@gmail.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-02-25 22:10:15 -05:00
George Hotz
df3b320f46
rewriter -> devectorizer [pr] ( #9147 )
2025-02-18 12:42:08 +08:00