Release Version v0.1.8 Release Today! · hpcaitech/ColossalAI

What's Changed

Hotfix

[hotfix] torchvison fx unittests miss import pytest (#1277) by Jiarui Fang
[hotfix] fix an assertion bug in base schedule. (#1250) by YuliangLiu0306
[hotfix] fix sharded optim step and clip_grad_norm (#1226) by ver217
[hotfix] fx get comm size bugs (#1233) by Jiarui Fang
[hotfix] fx shard 1d pass bug fixing (#1220) by Jiarui Fang
[hotfix]fixed p2p process send stuck (#1181) by YuliangLiu0306
[hotfix]different overflow status lead to communication stuck. (#1175) by YuliangLiu0306
[hotfix]fix some bugs caused by refactored schedule. (#1148) by YuliangLiu0306

Tensor

[tensor] distributed checkpointing for parameters (#1240) by Jiarui Fang
[tensor] redistribute among different process groups (#1247) by Jiarui Fang
[tensor] a shorter shard and replicate spec (#1245) by Jiarui Fang
[tensor] redirect .data.get to a tensor instance (#1239) by HELSON
[tensor] add zero_like colo op, important for Optimizer (#1236) by Jiarui Fang
[tensor] fix some unittests (#1234) by Jiarui Fang
[tensor] fix a assertion in colo_tensor cross_entropy (#1232) by HELSON
[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) by HELSON
[tensor] torch function return colotensor (#1229) by Jiarui Fang
[tensor] improve robustness of class 'ProcessGroup' (#1223) by HELSON
[tensor] sharded global process group (#1219) by Jiarui Fang
[Tensor] add cpu group to ddp (#1200) by Jiarui Fang
[tensor] remove gpc in tensor tests (#1186) by Jiarui Fang
[tensor] revert local view back (#1178) by Jiarui Fang
[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176) by Jiarui Fang
[Tensor] rename parallel_action (#1174) by Ziyue Jiang
[Tensor] distributed view supports inter-process hybrid parallel (#1169) by Jiarui Fang
[Tensor] remove ParallelAction, use ComputeSpec instread (#1166) by Jiarui Fang
[tensor] add embedding bag op (#1156) by ver217
[tensor] add more element-wise ops (#1155) by ver217
[tensor] fixed non-serializable colo parameter during model checkpointing (#1153) by Frank Lee
[tensor] dist spec s2s uses all-to-all (#1136) by ver217
[tensor] added repr to spec (#1147) by Frank Lee

Fx

[fx] added ndim property to proxy (#1253) by Frank Lee
[fx] fixed tracing with apex-based T5 model (#1252) by Frank Lee
[fx] refactored the file structure of patched function and module (#1238) by Frank Lee
[fx] methods to get fx graph property. (#1246) by YuliangLiu0306
[fx]add split module pass and unit test from pipeline passes (#1242) by YuliangLiu0306
[fx] fixed huggingface OPT and T5 results misalignment (#1227) by Frank Lee
[fx]get communication size between partitions (#1224) by YuliangLiu0306
[fx] added patches for tracing swin transformer (#1228) by Frank Lee
[fx] fixed timm tracing result misalignment (#1225) by Frank Lee
[fx] added timm model tracing testing (#1221) by Frank Lee
[fx] added torchvision model tracing testing (#1216) by Frank Lee
[fx] temporarily used (#1215) by XYE
[fx] added testing for all albert variants (#1211) by Frank Lee
[fx] added testing for all gpt variants (#1210) by Frank Lee
[fx]add uniform policy (#1208) by YuliangLiu0306
[fx] added testing for all bert variants (#1207) by Frank Lee
[fx] supported model tracing for huggingface bert (#1201) by Frank Lee
[fx] added module patch for pooling layers (#1197) by Frank Lee
[fx] patched conv and normalization (#1188) by Frank Lee
[fx] supported data-dependent control flow in model tracing (#1185) by Frank Lee

Rename

[rename] convert_to_dist -> redistribute (#1243) by Jiarui Fang

Checkpoint

[checkpoint] save sharded optimizer states (#1237) by Jiarui Fang
[checkpoint]support generalized scheduler (#1222) by Yi Zhao
[checkpoint] make unitest faster (#1217) by Jiarui Fang
[checkpoint] checkpoint for ColoTensor Model (#1196) by Jiarui Fang

Polish

[polish] polish repr for ColoTensor, DistSpec, ProcessGroup (#1235) by HELSON

Refactor

[refactor] move process group from _DistSpec to ColoTensor. (#1203) by Jiarui Fang
[refactor] remove gpc dependency in colotensor's _ops (#1189) by Jiarui Fang
[refactor] move chunk and chunkmgr to directory gemini (#1182) by Jiarui Fang

Context

[context]support arbitary module materialization. (#1193) by YuliangLiu0306
[context]use meta tensor to init model lazily. (#1187) by YuliangLiu0306

Ddp

[ddp] ColoDDP uses bucket all-reduce (#1177) by ver217
[ddp] refactor ColoDDP and ZeroDDP (#1146) by ver217

Colotensor

[ColoTensor] add independent process group (#1179) by Jiarui Fang
[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) by Jiarui Fang
[ColoTensor] improves init functions. (#1150) by Jiarui Fang

Zero

[zero] sharded optim supports loading local state dict (#1170) by ver217
[zero] zero optim supports loading local state dict (#1171) by ver217

Workflow

[workflow] polish readme and dockerfile (#1165) by Frank Lee
[workflow] auto-publish docker image upon release (#1164) by Frank Lee
[workflow] fixed release post workflow (#1154) by Frank Lee
[workflow] fixed format error in yaml file (#1145) by Frank Lee
[workflow] added workflow to auto draft the release post (#1144) by Frank Lee

Gemini

[gemini] refactor gemini mgr (#1151) by ver217

Pipeline

[pipeline]add customized policy (#1139) by YuliangLiu0306
[pipeline]support more flexible pipeline (#1138) by YuliangLiu0306

Ci

[ci] added scripts to auto-generate release post text (#1142) by Frank Lee

Full Changelog: v0.1.8...v0.1.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version v0.1.8 Release Today!

What's Changed

Hotfix

Tensor

Fx

Rename

Checkpoint

Polish

Refactor

Context

Ddp

Colotensor

Zero

Workflow

Gemini

Pipeline

Ci