* failing test
* cleaner failing tests
* assign and read of same slice shouldn't create copies
* err in the changes
* shrink with no overlapping regions in dest is fine
The memory planner was suballocating BUFFERs created during JIT capture that are still referenced by external lazy tensor graphs, like the .grad tensors assigned by backward(). The replay then only writes the arena slices, so realizing such a tensor after the call reads freshly allocated memory and silently returns zeros. Hold every BUFFER reachable from a live Tensor instead of only the parameters of the return value; true internals are still planned. Fixes#16571.
* draft
* cleanup test_encodings
* cleanup test_isel
* model flag state and support rematerialization
* woops
* add vbroadcastss instruction
* don't fuse load if used multiple times in src
* add movabs instruction and fix idiv
* fixes
* add x86 backend to tests
* float16 fix
* rm TwoAddress2nd
* add BARRIER
* test windows ci
* yup isel fixes the mask stuff too and its beautiful
* add cmoves to the spec
* support storing imms
* no TUPLE_ORDER, breaks tests
* fix remaining seg faults
* add float max
* always fuse index
* minor
* fix DEFINE_VAR/SPECIAL and enable multithreading
* linter
* more linter
* more
* more
* more
* let's try this
* perhaps
* start new scheduler
* more scheduling info
* cleaner shuffle functions
* fixup isel tests
* skip bounds check when NOOPs exist
* skip inf rewrite tests
* fix const tag hack and add x86ops to _shape
* fix
* skip a few tests
* func arg order independent from op value
* x86 goes in own linearize
* switch to PARAM
* more
* add min x86op and neg in decomps
* do mulacc in isel
* use def_reg in test_encodings
* enable emulated int64 tests
* how much does this fix
* Ops becomes OpType
* fix
* rm noqa
* rm machine scheduler stuff
* and this
* allow for extending enums and move X86Ops out of uop
* fix imports
* rm X86GroupOp from ops.py
* spacing
* tell mypy to shut up
* more linter
* add x86op test
* allow set[X86Ops] in upat
* move NOOPs to pre_isel_matcher and rm NOOP from spec
* more asserts
* also this
* cleanup encode
* simplify live range
* fix idiv
* add Ops.INS to x86
* more changes
* more changes
* more changes
* fix
* fix
* fix
* fix
* print formatted assembly
* fix 8bit idiv?
* oops
* enable float16 and unaligned vector load/store
* actually no
* move x86 tests
* no more bool cast
* fix
* linter
* linter
* move X86Ops to x86.py
* fix vpbroadcast
* cleanups
* linter
* print correct reg names
* canonical max
* move max/min and add test
* support float16 vector load/store
* rm bad rewrite
* vpsrldq can't access memory
* regalloc takes renderer
* enable vector load/store on all dtypes
* more isel tests
* rm this for now
* a lot better
* fix
* fix
* fix
* deal with flags correctly
* fix
* enable gep noop rule
* fix
* fix
* fix
* add callee saved registers
* use Ops.CONST instead of X86Ops.IMM
* fix
* enable TUPLE_ORDER
* fix
* rm x86 code in linearizer
* fix
* fix
* fix
* move isa rewrites to codegen
* fix
* fix
* skip test_linearizer.py
* skip more tests
* fix
* fix for idiv/mod changes
* fix
* don't use fmadd if it duplicates fused op
* hacky
* fix
* cleanups
* cleanups
* fix
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>