Skip to content

Releases: apache/tvm

Apache TVM v0.10.0

17 Oct 17:44
Compare
Choose a tag to compare

Introduction

The TVM community has worked since the v0.9 release to deliver the following new exciting improvments!

  • Metaschedule
    • Software pipelining and padding for irregular shapes for auto tensorization
    • Stabilized and polished user-interfaces (e.g. database changes, tune_relay)
    • A new MLP-based cost model
  • TIR
    • New schedule primitive for PadEinsum
    • A new TIR node: DeclBuffer
    • INT8 Intrinsics for TensorCores for CUDA!
  • microTVM
    • Improved schedule primitives for ARM v8-m ISA

And many other general improvements to code quality, TVMScript, and more! Please visit the full listing of commits for a complete view: v0.9.0...v0.10.0rc0.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Please visit the full listing of commits for a complete view: v0.9.0...v0.10.0rc0.

Note that this list is not comprehensive of all PRs and discussions since v0.9. A non-truncated summary can be found here: #12979

TIR

  • #12720 - [TIR] Implement API for padded layout transformations
  • #12797 - [TIR] Construct the inverse in SuggestIndexMap
  • #12827 - [TIR] Support pattern matching argmax/argmin generated by TOPI
  • #12750 - [TIR, Schedule] Add schedule primitive PadEinsum
  • #11639 - [TIR][Meta-Schedule] Tuple-reduction scheduling support
  • #12515 - [TIR][Arith] Add more strict checking in imm construction and folding.
  • #12717 - [TIR, Schedule] Check consumer in-bound and covered in reverse_compute_inline
  • #12652 - [TIR] Handle axis_separators during FlattenBuffer
  • #12623 - [TIR] Expose MMA-related PTX builtins
  • #12607 - [TIR][Schedule] enhance compute_at and reverse_compute_at primitive to choose possible position
    ...

Apache TVM v0.9.0

14 Jul 22:33
d361585
Compare
Choose a tag to compare

Introduction

The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

  • MetaSchedule's full implementation
  • ARM cascading scheduler for Arm Ethos(TM)-U NPUs
  • Collage which brings tuning to BYOC
  • Several microTVM improvements
  • New tvm.relay.build parameters - runtime=, executor=,
  • AOT - Support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime
  • Hexagon RPC support
    • Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
    • AOT and USMP support
    • Threading
    • Initial op support
  • MLF - Support for multiple modules in a single MLF artifact
  • Several TIR schedule primitives and transforms including (abridged):
    • schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.
    • schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.
    • schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).
    • transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.
    • transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.
    • transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.
    • transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.
  • And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: v0.8.0...v0.9.0.rc0.

AOT

  • #11208 - Calculate used memory at the callsite of primitive functions
  • #11365 - Fix function number datatype from char to uint16_t
  • #11091 - Enable A-Normal Form in the AOT executor
  • #10753 - Support LLVM backend with C++ runtime
  • #10518 - Use python temporary directory for AOT tests
  • #10337 - BugFix of workspace calculation
  • #10282 - [runtime] Add Metadata classes for AOTExecutor
  • #9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
  • #9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
  • #9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

Read more

Apache TVM v0.8 Release Note

24 Nov 17:14
7b3a22e
Compare
Choose a tag to compare

Overview

Apache TVM v0.8 brings several major exciting experimental features, including:

  • PaddlePaddle frontend
  • TVMScript: round-trippable python-based syntax for TIR
  • TorchScript integration
  • TensorIR scheduling language
  • TensorRT and CUTLASS integration via BYOC
  • Int4 TensorCore support in AutoTVM
  • MicroTVM Project API and Zephyr, Arduino support
  • AOT executor
  • Robust Windows support
  • Affine analysis infra: iter-affine-map
  • Improved Vulkan backend
  • CUDA graph support in TVM runtime

Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:

  • Relay compilation engine
  • Relay pattern language
  • CI and build process
  • Refactoring documentation and tutorials
  • Stablizing AutoScheduler
  • Stablizing TVMC command line driver interface
  • Stablizing target system
  • Frontend coverage, quantization, dynamic shape, training

Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.

Accepted RFCs

The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:

  • [RFC-0005] Meta schedule (AutoTIR)
  • [RFC-0006] Automatic mixed-precision pass and support
  • [RFC-0007] Parametrized unit tests
  • [RFC-0008] MicroTVM Project API
  • [RFC-0009] Unified static memory planner
  • [RFC-0010] Target-registered compiler flow customisation
  • [RFC-0011] Arm® Ethos-U integration
  • [RFC-0014] Pipeline executor
  • [RFC-0015] Use CMSIS-NN with TVM
  • [RFC-0019] Add PaddlePaddle frontend
  • [RFC-0020] Extend metadata in project option
  • [RFC-0022] TIR non-scalar constants
  • [RFC-0023] Adding annotation field to tir.allocate nodes
  • [RFC-0025] PyTorchTVM
  • [RFC-0027] Formalize TVM documentation organization
  • [RFC-0028] Command line composition from internal registry
  • [RFC-0029] Migrating target attributes to IRModule
  • [RFC-0030] Command line configuration files
  • [RFC-0031] C Device API
  • [RFC-0036] TVMScript namespace
  • [RFC-0041] Update TVMScript block syntax

Features and Improvements

TE, TIR, TVMScript

AutoTVM, AutoScheduler, Meta Schedule

Operator Coverage

Read more

Apache TVM (incubating) v0.7.0

02 Oct 18:30
728b829
Compare
Choose a tag to compare

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Introduction

v0.7 brings many major features. The community works together to refactor the internal code base to bring an unified IR code structure with unified IRModule, type system and pass infrastructure. We have also bought many exciting new features, some highlights include:

  • Initial automatic scheduling support
  • Initial command line driver interface
  • WebGPU and webassembly support
  • Better first class rust support in the codebase
  • Intial Hexagon support
  • Bring your own codegen (BYOC) support

The community also continues to bring high quality improvements to the existing modules including, but not limited to: better frontend coverage, performance, quantization, uTVM and dynamic shape support.

New Features

Automatic Scheduling (Experimental)

  • Phase 0: Ansor minimum system for auto schedule generating #5962
  • Phase 1: Access Analyzer #6103
  • Phase 1: Add follow_split and follow_fused_split steps #6142
  • Phase 1: Add pragma/storage_align/rfactor steps #6141
  • Phase 1: Add RPC Runner #6077
  • Phase 1: Add annotation/compute_at/compute_root/compute_inline steps #6073
  • Phase 1: Add cache_read/cache_write steps #6107
  • Phase 1: Rename namspace form auto_schedule to auto_scheduler #6059
  • Phase 1: The base class for cost models #6187
  • Phase 1: feature extraction for cost models #6190
  • Phase 1: XGBoost Cost Model #6270
  • Phase 2: Basic GPU Sketch Search Policy #6269
  • Phase 2: Evolutionary Search #6310
  • Phase 2: Update heavy operations with parallel_for #6348
  • Parallel the InitPopulation (#6512)
  • Tutorial: Using the template-free auto-scheduler on CPU (#6488)

BYOC

  • External codegen support in Relay (#4482),(#4544)
  • Bring Your Own Codegen Guide -- Part 1 #4602
  • Bring Your Own Codegen Guide -- Part 2 #4718
  • Relay annotation and partitioning for external compilers #4570
  • JSON Runtime with DNNL End-to-End Flow #5919
  • Handle one symbol for each runtime #5989
  • Run accelerator specific optimizations #6068
  • Arm Compute Library integration #5915
  • Retire the example json runtime #6177
  • json_node.h should include data_type.h #6224
  • Improve installation tutorial #6170
  • Add support for dense (fully connected) layer #6254
  • Introduce the Ethos-N BYOC integration #6222
  • Enable remote device via environment variables #6279
  • Improved pooling support #6248
  • Add support for quantized convolution #6335
  • CoreML codegen #5634

Operator Coverage

  • Add strided_set operation (#4303)
  • Add support for conv3d (#4400), pool3d (#4478), 3d upsampling ops (#4584)
  • Add group convolution for VTA (#4421)
  • Add 1d deconvolution op (#4476)
  • Allow batch matmul to be fused into injective ops (#4537)
  • Add native depthtospace and spacetodepth operators (#4566)
  • Add CUDNN conv3d support (#4418)
  • Dilation2D operator support #5033
  • Isfinite operator #4981
  • Unravel Index operator #5082
  • Add thrust support for nms #5116
  • Resize3d, Upsample3d op support #5633
  • Add operator Correlation #5628
  • affine_grid and grid_sample #5657
  • Sparse to dense operator #5447
  • Conv3d_transpose op support added #5737
  • add op crop_and_resize #4417
  • Add bitwise ops #4815
  • Sparse to dense operator #5447
  • support dynamic NMS(Non Maximum Suppression), symbolic begin, end, and strides for strided_slice #4312
  • Conv3d_transpose op support added #5737
  • ReverseSequence operator #5495
  • Conv1D #4639
  • 1D Pooling #4663

Quantization

  • Channel wise quantization - Quantize & Requantize #4629
  • Support QNN ops. #5066
  • Adding support for QNN subtract op #5153
  • TFLite QNN Tutorial #5595
  • Tutorial: Deploy Quantized Model on CUDA #4667
  • Support asymmetric per-layer quantized operators #6109

Relay

  • Add convertlayout pass in Relay (#4335, #4600)
  • Added Merge Composite pass #4771
  • Call graph for relay #4922
  • Add inline pass #4927
  • Target annotation for external codegen #4933
  • GradientCell Relay Pass #5039
  • Add MergeCompilerRegions pass #5134
  • Non-recursive Graph Vistor and Rewriter (#4886)
  • [Blocksparse] Pipeline for lowering dense model to sparse-dense (#5377)
  • Relay op strategy #4644
  • Static Tensor Array (#5103)
  • Memory planner (part 1) #5144
  • ONNX codegen #5052
  • Add Parser 2.0 #5932, part 2 #6162
  • Basic block normal form #6152
  • Convert Layout pass. #4664
  • Pattern Language, Matcher, Rewriter, and Function Paritioner #5231

Runtime and Backend

  • Add ADTObject POD container type (#4346)
  • TFLite RPC runtime (#4439)
  • Standardized graph runtime export (#4532)
  • MISRA-C compliant TVM runtime #3934
  • Add String container #4628
  • Introduce Virtual Memory Allocator to CRT (#5124)
  • Initial implementation of Hexagon runtime support (#5252)
  • FastRPC interface for Hexagon runtime (#5353)
  • CoreML Runtime (#5283)
  • AutoTVM + uTVM for Cortex-M7 (#5417)
  • Windows Support for cpp_rpc (#4857)
  • Implement TVMDSOOp(TensorFlow custom op) for TVM runtime (#4459)
  • WebGPU support #5545
  • TVM WebAssembly JS Runtime #5506
  • Hexagon driver for offloading kernels to simulator #5492
  • Introduce runtime::Array #5585
  • Allow non-nullable ObjectRef, introduce Optional. (#5314)
  • Introduce static slots for common objects. (#5423)
  • ntroduce RValue reference(move) support to TypedPackedFunc (#5271)
  • Introduce MetadataModule to separate code compilation/interpretation and weight initialization #5770
  • Support module based interface runtime #5753
  • Add TVM application extension with WASM runtime #5892
  • Provide guide to user who has difficulty register SEqualReduce (#5300)

Rust Support

  • Revive the Rust + SGX refactor #4976
  • Improve Rust bindings: Map, Array, String, various IR nodes #6339
  • Rust Refactor Stage 4: Rewrite Rust graph runtime to use new APIs #5830
  • Second stage of Rust Refactor #5527
  • tvm crate stage 3 of Rust refactor #5769
  • Add first stage of updating and rewriting Rust bindings. #5526

TIR

  • Introduce StructuralHash for the Unified IR. #5160
  • Introduce StructuralEqual Infra for the unified IR. #5154
  • Introduce ExprDeepEqual, Remove IRDeepCompare #5206
  • [TIR] Introduce BufferLoad/Store (#5205)
  • Improved massive build times caused by tir.floormod and tir.floordiv. Fixed Topi testcase. #5666
  • Buffer logger assert removed #6147
  • Enhance VerifyGPUCode #6194
  • HoistIfThenElse added #6066
  • Hybrid Script Support for TIR #6227
  • Migrate Low-level Passes to Pass Manager #5198
  • HoistIfThenElse added #6066
  • Hybrid Script Support for TIR #6227
  • Block scope hoisting added #6238

TE

  • reverse-mode autodiff without any optimization #5121
  • Tensor Expression Debug Display (TEDD) #4651
  • Optimize and eliminate the Jacobian tensor for te.autodiff #6078

TVMC(Experimental)

  • TVMC - A command line driver for TVM (Part 1) #6112
  • TVMC - Linting error on onnx command line driver frontend #6536
  • TVMC - Command line driver 'compile' (part 2/4) #6302
  • TVMC - Introduce 'tune' subcommand (part 3/4) #6537
  • TVMC - Introduce 'run' subcommand (part 4/4) #6578
  • TVMC - Getting started tutorial for TVMC #6597

Feature Improvement

Accelerator and Microcontroller Support

  • Cleanup legacy verilog code (#4576)
  • uTVM support for ARM STM32F746XX boards (#4274)
  • Add --runtime=c, remove micro_dev target, enable LLVM backend #6145

Arithmetic Analysis

  • Linear system and equation solver (#5171)
  • Inequalities solver #5618
  • Improve IntervalSet's floormod (#5367)
  • Remove legacy const pattern functions (#5387)
  • Handle likely in IRMutatorWithAnalyzer #5665
  • ExtendedEuclidean merge impl to int_operator #5625
  • Rewrite simplify fix for Vectorized Cooperative Fetching #5924

AutoTVM and Graph Tuner

  • Adding ROCM schedules for TOPI (#4507)
  • NHWC conv2d schedule templates for ARM (#3859)
  • Use VM compile to extract autotvm tasks #4328
  • Download fallback schedule file if it does not exist #4671
  • Ignore error when removing tmpdir #4781
  • Fix a bug in generating the search space #4779
  • Minor bug fixes in AutoTVM for QNN graphs #4797
  • Fix autotvm customized template #5034
  • Add opt out operator for has_multiple_inputs for graph tuner #5000
  • Customize SI prefix in logging (#5411)
  • Update XGBoost verbosity option #5649
  • Support range in index based tuners #4870
  • Enable random fill and CPU cache flush for AutoTVM and Ansor (#6391)
  • Auto-scheduler tutorial for GPU and necessary refactor/fix (#6512)

BYOC

  • [BYOC] Bind constant tuples in graph partitioner (#5476)
  • [BYOC] Add support for composite functions in BYOC (#5261)
  • [BYOC] Register pattern tables from external codegens (#5262)
  • [BYOC] Enhance partitioning and external codegen (#5310)
  • [BYOC] Refine AnnotateTarget and MergeCompilerRegion Passes (#5277)
  • [BYOC] Use Non-Recursive Visitor/Mutator (#5410)
  • [BYOC] Refine DNNL Codegen (#5288)
  • [BYOC] Add example of Composite + Annotate for DNNL fused op (#5272)
  • [BYOC] Prevent duplicate outputs in subgraph Tuple (#5320)
  • [BYOC] Introduce further operator support (#6355)
  • [BYOC] Support input nodes with multiple entries (#6368)
  • [BYOC] Add maximum support for float32 (#6506)

Codegen

  • Intrinsic dispatching with OCML instead of LLVM for ROCm (#4499)
  • Make target codegen take IRModule and PrimFunc. #5107
  • Enhance CUDA codegen for SelectNode #4983
  • Vectorization for intrinsics #5101
  • [LLVM] Do not...
Read more

Apache TVM (incubating) v0.6.1

10 Jul 19:29
0d0d515
Compare
Choose a tag to compare

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Apache TVM (incubating) 0.6.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache TVM (incubating) 0.6.0 are advised to upgrade. Please review following release notes to learn the bug fixes.

Bug Fixes

  • Fixed process termination routine in windows #4844
  • [Runtime] Fix NDArray SaveDLTensor declaration and implementation signature different #4586
  • [NODE][Serialization]fix serialization precision loss in float #4503
  • [Relay][Frontend][TF] fix _parse_param bug #4711
  • Fix bias_add gradient #4516
  • Make sure to visit the arguments of inlined functions #4783
  • Fix Python syntax error in start_rpc_server_to_tracker.py #4682
  • [Bugfix] Fixed crash caused by reversing bitwise operations #4852
  • [Fix][VM] Fix copy constructor #5237
  • fix small bug about dense_grad #5695
  • [Fix] Fix conv2d alter op for arm cpu #5532
  • [Fix] Fix dense x86 schedule #4728
  • [Relay][Fix] Fix alter op layout when calling a global var #4454
  • [Relay][Pass] Fix lambda lift pass for recursive call #4432
  • [BUGFIX] Fix search path for libtvm_topi.so #4467
  • [Bugfix] Fix Python debugger segfaults with TVM built with LLVM #5685
  • [RUNTIME] Fix compile errors of OpenCL FPGA backend #4492
  • [BUGFIX][BACKPORT-0.6][ARITH] Fix FloorMod Simplifier #5509
  • Some Windows and MSVC fixes #4569
  • [Chisel][VTA] Fix multiple transfer issue in LoadUop module #4442
  • [VTA] Fix an issue in updating uop_idx in the TensorGemm module #4694
  • [VTA] Fixed a crash issue in TSIM driver #4527
  • [VTA] Enable streamlined GEMM execution #4392
  • [VTA][Chisel] End-to-end Inference with Chisel VTA #4574
  • Added declare of aluBits for TensorAlu #4624
  • [Quantization] Fix annotation for multiply op #4458
  • LRN only supports 4D tensors, remove it from alter_op_layout #5520
  • fix topi.nn.global_pool layout="NHWC" #4656
  • [FFI][Windows] Fix hasattr by extracting Python error type from Windows error message #4780
  • [Runtime] Export GraphRuntime in tvm_runtime.dll #5002
  • Fix Base64OutStream portability issue #4668
  • [AUTOTVM] Fix a bug in generating the search space #4779
  • [Relay][VM] Fix compilation of If-Elses #5040
  • [RELAY][FRONTEND][TENSORFLOW] Fix FuseBatchNorm output cast error if need_cast is True #4894
  • [Bugfix] fskip of EliminateCommonSubexpr cannot always return false #4620
  • [Fix] Add ConstantNode to IsAtomic #5457
  • [Fix] Fix RemoveUnusedFunctions pass #4700
  • [Realy][fix] Fix alpha_equal bug for attribute check #4897
  • [Arith] keep div_mode during floordiv simplify #5922
  • [ARITH][BACKPORT-0.6] fix a min/max simplify bug #5761
  • [0.6-BACKPORT] Improve robustness of the docs build #5583

Apache TVM (incubating) v0.6.0

05 Dec 06:47
Compare
Choose a tag to compare

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

New Features

Relay in Production

Relay is a functional, differentiable programming language designed to be an expressive intermediate representation for machine learning systems. Relay supports algebraic data types, closures, control flow, and recursion, allowing it to directly represent more complex models than computation graph-based IRs (e.g., NNVM) can. In TVM v0.6, Relay is in stable phase and is ready for production.

  • Algebraic Data Types (ADT) support (#2442, #2575). ADT provides an expressive, efficient, and safe way to realize recursive computation (e.g., RNN). Refer to https://docs.tvm.ai/langref/relay_adt.html for more information.
  • Pass manager for Relay (#2546, #3226, #3234, #3191)
  • Most frameworks have been supported in Relay, including ONNX, Keras, Tensorflow, Caffe2, CoreML, NNVMv1, MXNet (#2246).
  • Explicitly manifest memory and tensor allocations in Relay. (#3560)

Relay Virtual Machine

The Relay Virtual Machine (Relay VM) is the new generation of runtime to strike a balance between performance and flexibility when deploying and executing Relay programs. Previously, the graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

Relay VM is now usable and is able to achieve decent performance for a various of models and targets.

  • Design (#2810 #2915) and a first version of implementation (#2889),
  • Add VM runtime for Relay and compiler support (#3120, #3121, #2889, #3139)
  • Relay VM (pattern matching #3470, port to python #3391, serialization #3647)
  • Relay VM Profiler (#3727)
  • Support execution on devices for Relay VM (#3678)
  • [Relay][VM] Add more passes to VMCompiler (#4058)
  • [relay][vm] Separate VM runtime with executable (#4100)
  • Port VM, VM compiler, and Object into Python (#3391)
  • VM: Add AllocTensor instruction and better instruction printer (#3306)
  • [Relay][VM][Interpreter] Enable first-class constructors in VM and interpreter via eta expansion. (#4218)
  • [Relay][VM] Clean up the VM and VM profiler code (#4391)

Training

Relay is designed to natively support first-order and higher-order differentiation. The automatic differentiation infrastructure is now usable and a count of operators with gradient support are available in v0.6 release.

  • Higher order reverse mode automatic differentiation that work with control flow (#2496)
  • Higher order continuation passing style (#3456, #3485 )
  • Relay gradient registration (clip #3509, max_pool2d and avg_pool2d #3601)
  • Relay AD algorithm (#3585)
  • Relay Training - allow gradient to return a tuple (#3600), numerical gradient check (#3630)
  • Improve AD for concatenate (#3729)
  • [Relay][Training] Add missing gradient check to gradient pass (#4169)
  • As a part of Relay's automatic differentiation system, we are adding primal gradients for Relay operators. Please refer to #2562 for tracking the progress.
  • Gradient for Conv2d (#3636)
  • Add gradient operators (#3857, #3894, #3901, #3915)
  • Add gradient for log-softmax (#4069)
  • [Relay][Training] Add gradient for Crossentropy (#3925)
  • [Relay][Training] Add and fix gradients (#4126)

Quantization

Low-bit inference is getting more and more popular as it benefits both the performance and storage usage. TVM now supports two types of quantization. 1. Automatic quantizaion takes floating-point precision model, does per-layer calibration and generates low-bit model. 2. TVM also imports pre-quantized model from Tensorflow and MXNet, a new dialect QNN is introduced to handle further lowering to normal operators.

  • Automatic Quantization
    • Low-bit automatic quantization supported. (#2116). The workflow includes annotation, calibration and transformation.
    • Refactor quantization codebase and fix model accuracy. (#3543)
    • KL-divergence-based per-layer calibration. (#3538)
    • Add option to select which convolution layers are quantized. (#3173)
    • [Relay][Quantize] Integrate data-aware calibration into quantization. (#4295)
  • Pre-quantized model support (QNN operators and legalize pass).
    • Add a legalize pass to Relay (#3672)
    • Qnn Concatenate, quantize, dequantize and requantize operators (#3819, #3730, #3745, #3531)
    • QNNtoRelay & QNNLegalize Pass utility (#3838, #3782)
    • Requantize: Optimize lowering for some corner cases. (#3864)
    • New quantized operator support: conv2d, add, dense (#3580, #3736, #3896, #3910)
    • Do type checking for the input and kernel in the qnn conv2d (#3904)
    • Legalize and AlterOpLayout for Intel int8. (#3961)
    • Renaming tests to follow the Relay nomenclature. (#3975)
    • Fix padding changes due to #3739 (#3989)
    • Memorizing quantize node mapping to avoid duplicated simulated quantization (#3233)
    • Infrastructure to support pre-quantized models (QNN) (#3971).
    • [Relay][AlterOp] NHWC to NCHWc support for Pool, concatenate, sum. (#4059)
    • [TOPI][x86] Cascade lake support. (#4123)
    • [TOPI][x86] Legalize - Support int8xint8 convolution to use VNNI inst (#4196)
    • Qnn dequantize with min max using Mxnet flavor to support Mxnet prequantized models. (#3945)
    • Improve the lowering of Qnn Dense (#4213)
    • Adding support for dequantizing from int32 to float32. (#4130)
    • [QNN] Refactor fixed point multiplicat...
Read more

v0.5-pre-apache-incubation

18 Feb 22:49
Compare
Choose a tag to compare

NOTE: This is a release pre apache incubation

This release features several major improvements. Some of the highlights are: Arbitrary bits quantization algorithm; High-level auto-differentiable programming IR--Relay(NNVMv2).

The community welcomes new reviewers @nishi-t @were @siju-samuel @jroesch @xqdan @zhiics @grwlf @ajtulloch @vinx13 @junrushao1994 @FrozenGene @liangfu , new committers @srkreddy1238 @eqy @masahi @nhynes @phisiart @merrymercy @Laurawly @adityaatluri @Huyuwei

Change List

  • Fully featured 8-bit network support
    • 8bit quantizer
    • Arbitrary bits quantization algorithm
    • Intel cpu support
  • NVidia GPU 8-bit kernel
    • int8 gemm recipe
    • int8 conv2d
    • Autotvm integration
  • Automated tuning and scheduling
    • AutoTVM optimizations for mobile GPUs
    • AutoTVM optimizations for CUDA
    • AutoTVM optimizations for x86
  • Initial release of the differentiable programming IR, Relay
    • Generic & informative Relay error reporting #2408
    • Relay IR text format support #1781
    • Support control flows
    • A Normal Form Canonicalization #2251
    • Type system support
    • End to end compilation
    • FoldScaleAxis #2020
    • SimplifyInference #2033
    • CombineParallelConv2D #2089
    • InstrumentBoundCheckers pass #2079
    • Bind & FoldConstant #2100
    • Alter Op Layout #2150
    • General OpFusion #2090
  • CodeGen
    • Gcc / g++ compatible C code generator for TVM #2161
    • Device type annotation for heterogeneous compilation #2361
    • Cache packed func ptr, lift alloca #2070
    • Generalize compute to tensor region #1476
  • Runtime
    • Relay interpreter and compiler #1954
    • Heterogeneous runtime #1695
    • Language bindings: Golang runtime #1470 , Rust runtime #1597
    • Add min_repeat_ms to time_evaluator #2200
    • Bundled interpreter demonstration #2297
    • Enable PlanMemory in the graph runtime #2120
  • Language Binding
  • VTA
    • Improved RPC for VTA #2043
  • Hybrid python programming model
    • Support for scheduling #2416
    • Support for Inter-function call #2287
    • Backend support #2477
  • TOP
    • Initial support for sparse tensor computation
    • Improve ARM CPU depthwise convolution performance #2345
    • Port winograd ops to relay #2356
  • Tutorials and docs
    • Relay language docs #2232
    • Tutorials on how to use SGX backend
    • How to write a pass in python
    • General lowering flow of TVM
    • How to do tensorize
    • TFLite frontend tutorial #2508
    • Keras seq2seq model for translation tutorial #1815
    • Committer guide and tips #2468
    • Code review guideline on API designs #2459

Contributors

Code reviewers

Code contributions

v0.4-pre-apache-incubation

03 Sep 19:25
Compare
Choose a tag to compare

NOTE: This is a release pre apache incubation

This release features several major improvements. The high-level graph optimizer is now part of TVM repo. Some of the highlights are: Initial support of AutoTVM for automated optimization; customized accelerator backend VTA. Please also check out tvm.ai for latest blogposts.

The community welcomes new reviewers @kazum @alex-weaver @masahi @zhreshold @PariksheetPinjari909 @srkreddy1238 @eqy, new code owner @merrymercy, and new committer @yzhliu

Change List

Tensor Expression and Optimization

  • Tensor operator primitives
    • Introduce attrs field to operator primitives(e.g. compute) to store additional metadata, the attrs can be used as hint for scheduling
  • Enable embedding of asm micro-kernels
  • Hybrid python programming model
    • python AST based IR builder interface
    • support GPU programs
  • AutoTVM, Automated tuning, and scheduling
    • basic autotvm infra
    • GPU IR verifier
    • basic autotuning tutorial
    • topi integration
  • ARM support
    • winograd support
    • initial support of ARM autotuning records
  • TOPI Vision
    • Generic GPU sort support(useful for vision)
    • SSD operator support
  • TOPI numpy consistency
    • Rename all binary operators for numpy consistecy: broadcast_add-> add, broadcast_sub -> substract, broadcast_mul -> multiply, broadcast_div->divide
    • New operators: slice, LRN, equal, not_equal, less, greater
    • tutorials on topi
  • Initial low-bit operator support support
    • Optimized popcount generation on ARM
    • general bit-serial convolution and GEMM
    • optimized low bit kernels
    • parallel optimization
  • New topi backend optimization for intel graphics
  • Adapt AVX schedules for SSE target

Backend

  • VTA: customized accelerator backend
    • custom hardware backend example
    • tutorials on how to use customized accelerator
  • Initial experimental support for HLS backend
  • Bugfix in SPIRV code generator for vulkan
  • libdevice support, enable NVPTX backend

Runtime

  • Introduce NDArrayContainer for managed NDarray
  • RPC and Device API
    • Support communication between big/small endian machines.
    • RPC and device API protocol upgrade (this is a non-backward compatible change) to support big-small endian communication. This is a non-backward compatible change, need to use the latest version of TVM runtime with the RPC
    • graduate rpc from contrib, tvm.contrib.rpc->tvm.rpc
      -Support tracker in Android RPC, add fault tolerance for AutoTVM
  • BIG.LITTLE aware threadpool
  • tvm4j graph runtime that runs end to end workload in java
  • DLPack support
    • Support from_dlpack and to_dlpack
    • Enables bridges to pytorch
  • Enable link of stackvm in runtime

NNVM

  • Tensorflow graphdef frontend
  • Keras frontend
    • improved to support reuse layers, add activations
  • ONNX
    • gather, LRN
  • CoreML frontend
    • Support C-RNN and activation functions
  • Fix grads for sum and expand_like
  • Enhanced operator fusion for multiple elemwise branches
  • Separate nnvm fusion and compilation pass

Misc

  • Unified build system to cmake, customizable cmake path for vulkan, rocm, cuda

Contributors

See the complete list here. Thanks to all the contributors to contribute to this release.

Code reviewers

Compiler

TOPI, graph optimization

Frontends

Deploy

  • @eqy rpc, thread runtime
  • @dayanandasiet android tutorials

v0.3-pre-apache-incubation

03 Sep 19:21
Compare
Choose a tag to compare

NOTE: This is a release pre apache incubation

This release features numerous improvements in TOPI and backends. We make the first step toward object detection support in TOPI, featuring operators necessary for YOLO and SSDs. The topi now supports numpy-style API and operator overloading. RPC is significantly improved to support resource allocation and using a pool of devices. We are adding two new backends: WebGL for running GPUs on the browser, and Vulkan for running on next-generation graphics API. Please also check out tvm blogs for latest blogposts

Change List

  • TOPI Vision operators
    • SSD support
    • YOLO support
    • NMS operator support in vision
  • TOPI general numpy-style operators
    • numpy style operator overload in topi
    • more operators: flip, take
    • dilation support on conv2d and depthwise
  • 8bit support
    • ARM 8bit gemm
    • ARM 8bit conv
  • Low bit operator support
    • popcount intrinsics
    • 1-bit fully connected
  • Contrib: MPSDNN fully-connected and conv2d support
  • Better RPC support
    • RPC Tracker support to allow centralized resource management
    • RPC protocol upgrade (this is a non-backward compatible change) to support timeout in the proxy
      • This is a breaking change, need to use the latest version of TVM runtime with the RPC
    • Fault-tolerant to early server termination with correct exception propagated
    • RPC support enabled for ROCm AMDGPUs
  • Tutorials and docs
    • How to deploy to android devices.
  • Optimizations for hardware backends
    • intel CPU (AVX and AVX512)
  • Schedule Primitives
    • rfactor now support factor_axis to specify the factored dimension in the result
    • cache_write now support multiple output operators
    • enable warp memory which generates shuffle instructions
  • Framework bridge
    • MXNet bridge supported
  • C++ compiler API support
    • build migration
    • topi migration to c++
    • Target system in c++
  • WebGL backend
    • runtime and codegen
    • topi integration
    • end to end pipeline on the browser
  • Vulkan backend
    • vulkan runtime
    • spirv code generator
  • Security
    • intel SGX runtime support
    • multi-threaded SGX runtime
  • LLVM 7.0 support
  • Robustness
    • VerifyMemory to verify incorrect GPU schedules that writes into GPU memory from cpu
    • Verify compute formulas
  • Better CPU parallel runtime

Main Contributors

See complete list here. Thanks to all the contributors to contribute to this release.

Code Reviewers

TOPI:

Compiler:

v0.2-pre-apache-incubation

31 Jan 20:00
9e67577
Compare
Choose a tag to compare

NOTE: This is a release pre apache incubation

This release comes with a complete set of TOPI support for NNVM compiler, which allows compilation of end to end workloads. We also make major improvements in supporting new backends: ROCm for AMDGPUs and ARM GPU. Check out previous blogs that describes these major improvements in detail!

  • Backend support
    • Support LLVM mainline(4.0, 5.0, 6.0)
    • Support ROCM stack for AMD GPUs
    • More robust OpenCL support for ARM GPUs
  • Android RPC runtime
  • Multi-threading optimization for ARM
    • multi-threaded depthwise
    • multi-threaded conv2d
  • New schedule primitives
    • storage_align for shared memory alignment
    • double_buffer
  • UnrollLoop : more robust version of unroll loop, count maximum steps that can be unrolled.
  • Full set of TOPI operators
    • Introduce tvm.target to specify target options for compilation better.
    • broadcast/ reduction operators
    • pooling and global pooling
    • Generic target support for topi
    • schedule with external libraries
  • End to end deep learning pipelines for CPU, GPU, ARM GPU
  • Tutorials
    • How to load compiled module in any language runtime
    • How to use java runtime
  • Contrib library: MIOpen, CuDNN
  • Ongoing items that contains functioning pieces
    • WebGL backend
    • C++ compiler support
    • MPS DNN
    • low bit support, introduced popcount