Table of contents
- Breaking changes: removed
reinforce() - New features
- Unreduced losses
- A profiler for the autograd engine
- More functions support Higher order gradients
- New features in Optimizers
- New layers and nn functionality
- New Tensor functions and Features
- Other additions
- API changes
- Performance improvements
- Big reduction in framework overhead (helps small models)
- 4x to 256x faster Softmax/LogSoftmax
- More…
- Framework Interoperability
- DLPack Interoperability
- Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
- Bug Fixes (a lot of them)
Breaking changes
Stochastic functions, i.e. Variable.reinforce() were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice,
users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.
We introduce the torch.distributions package to replace Stochastic functions.
Your previous code typically looked like this:
1 | 复制代码probs = policy_network(state) |
This is the new equivalent code:
1 | 复制代码probs = policy_network(state) |
New features
Unreduced losses
Now, Some loss functions can compute per-sample losses in a mini-batch
- By default PyTorch sums losses over the mini-batch and returns a single scalar loss. This was limiting to users.
- Now, a subset of loss functions allow specifying
reduce=Falseto return individual losses for each sample in the mini-batch - Example:
loss = nn.CrossEntropyLoss(..., reduce=False) - Currently supported losses:
MSELoss,NLLLoss,NLLLoss2d,KLDivLoss,CrossEntropyLoss,SmoothL1Loss,L1Loss - More loss functions will be covered in the next release
An in-built Profiler in the autograd engine
We built a low-level profiler to help you identify bottlenecks in your models
Let us start with an example:
1 | 复制代码>>> x = Variable(torch.randn(1, 1), requires_grad=True) |
The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special nvprof prefix. For example:
1 | 复制代码nvprof --profile-from-start off -o trace_name.prof -- python <your arguments> |
Then, you can load trace_name.prof in PyTorch and print a summary profile report.
1 | 复制代码>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof') |
Read additional documentation here
Higher order gradients
Added higher-order gradients support for the following layers
- ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
- PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
- MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
- DataParallel
Optimizers
- optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
- In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
- Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.
New layers and nn functionality
Added AdpativeMaxPool3d and AdaptiveAvgPool3d
Added LPPool1d
F.pad now has support for:
- ‘reflection’ and ‘replication’ padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
- constant padding on n-d signals
nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in
nearestandlinearmodes.grid_sample now allows padding with the border value via
padding_mode="border".grid_sampleexpects a grid in
the range of[-1, 1], and if the values are out of these bounds, padding with the value0.0is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy
of the overall model.Introducing
nn.utils.parameters_to_vectorandnn.utils.vector_to_parametersparameters_to_vectortakesnet.parameters()and return a 1D vector that contains all the parametersvector_to_parameterstakes a vector of flattened parameters and copies the values over to a network’s parameters- Convenient for some reinforcement learning algorithms, such as cross-entropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
Allow user to not specify certain input dimensions for
AdaptivePool*dand infer them at runtime.- For example:
1
2复制代码# target output size of 10x7
m = nn.AdaptiveMaxPool2d((None, 7))
- For example:
DataParallel container on CPU is now a no-op (instead of erroring out)
New Tensor functions and features
- Introduced
torch.erfandtorch.erfinvthat compute the error function and the inverse error function of each element in the Tensor. - adds broadcasting support to bitwise operators
- Added
Tensor.put_andtorch.takesimilar tonumpy.takeandnumpy.put.- The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices. - The put function copies value into a tensor also using linear indices.
- Differences from
numpyequivalents:numpy.takehas an optional axis argument, which behaves likeindex_select. Thisaxisargument is not yet present.numpy.putrepeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
- The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
- add
zerosandzeros_likefor sparse Tensors. - 1-element Tensors can now be casted to Python scalars. For example:
int(torch.Tensor([5]))works now.
Other additions
- Added
torch.cuda.get_device_nameandtorch.cuda.get_device_capabilitythat do what the names say. Example:
1 | 复制代码>>> torch.cuda.get_device_name(0) |
- If one sets
torch.backends.cudnn.deterministic = True, then the CuDNN convolutions use deterministic algorithms torch.cuda_get_rng_state_allandtorch.cuda_set_rng_state_allare introduced to let you save / load the state of the random number generator over all GPUs at oncetorch.cuda.emptyCache()frees the cached memory blocks in PyTorch’s caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.
API changes
softmaxandlog_softmaxnow take adimargument that specifies the dimension in which slices are taken for the softmax operation.dimallows negative dimensions as well (dim = -1will be the last dimension)torch.potrf(Cholesky decomposition) is now differentiable and defined onVariable- Remove all instances of
device_idand replace it withdevice, to make things consistent torch.autograd.gradnow allows you to specify inputs that are unused in the autograd graph if you useallow_unused=True
This gets useful when usingtorch.autograd.gradin large graphs with lists of inputs
/ outputs
For example:
1 | 复制代码x, y = Variable(...), Variable(...) |
pad_packed_sequencenow allows apadding_valueargument that can be used instead of zero-paddingDatasetnow has a+operator (which usesConcatDataset). You can do something likeMNIST(...) + FashionMNIST(...)for example, and you will get a concatenated dataset containing samples from
both.torch.distributed.recvallows Tensors to be received from any sender (hence,srcis optional).recvreturns the rank of the sender.- adds
zero_()toVariable Variable.shapereturns the size of the Tensor (now made consistent with Tensor)torch.version.cudaspecifies the CUDA version that PyTorch was compiled with- Add a missing function
random_for CUDA. - torch.load and torch.save can now take a
pathlib.Pathobject, which is a standard Python3 typed filepath object - If you want to load a model’s
state_dictinto another model (for example to fine-tune a pre-trained network),load_state_dictwas strict on matching the key names of the parameters. Now we provide astrict=Falseoption toload_state_dictwhere it only loads in parameters where the keys match, and ignores the other parameter keys. - added
nn.functional.embedding_bagthat is equivalent tonn.EmbeddingBag
Performance Improvements
- The overhead of
torchfunctions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speeds-up models that are
very small, such as small LSTMs and other common models seen in NLP. - softmax and log_softmax are now 4x to 256x faster on the GPU after rewriting the gpu kernels
- 2.5x to 3x performance improvement of the distributed AllReduce (gloo backend) by enabling GPUDirect
- nn.Embedding’s renorm option is much faster on the GPU. For embedding dimensions of
100k x 128and a batch size of 1024, it is 33x faster. - All pointwise ops now use OpenMP and get multi-core CPU benefits
- Added dedicated CUDA kernels for group convolutions where
groups == nInputPlane(depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See the benchmark table for more details as well as this table. - Fixed
optim.SGD‘s memory usage for sparse gradients (for ex.nn.Embedding(..., sparse=True)), reducing the usage on a user-provided test script by 10x. - Optional NNPack integration for faster CPU convolutions (not part of binaries)
- Reduce overhead of broadcasting if Tensors aren’t broadcastable
torch.nn.utils.weight_normover the right-most dimensions is faster- Backward of
torch.normis sped up by ~1.5x - Improve the performance of
pack_padded_sequence - Add a single-argument version of
torch.arange. For exampletorch.arange(10)
Framework Interoperability
DLPack Interoperability
DLPack Tensors are cross-framework Tensor formats. We now have torch.utils.to_dlpack(x) and torch.utils.from_dlpack(x) to convert between DLPack and torch Tensor formats. The conversion
has zero memory copy and hence is very efficient.
Model exporter to ONNX
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNet-like and RNN-like (static graphs) can now be shipped to the ONNX
format.
- There is a new module torch.onnx (pytorch.org/docs/0.3.0/…) which provides the API for exporting ONNX models.
- The operations supported in this release are:
+ add, sub (nonzero alpha not supported), mul, div, cat, mm, addmm, neg, tanh, sigmoid, mean, t, transpose, view, split, squeeze
+ expand (only when used before a broadcasting ONNX operator; e.g., add)
+ prelu (single weight shared among input channels not supported)
+ threshold (non-zero threshold/non-zero value not supported)
+ Conv, ConvTranspose, BatchNorm, MaxPool, RNN, Dropout, ConstantPadNd, Negate
+ elu, leaky\_relu, glu, softmax, log\_softmax, avg\_pool2d
+ unfold (experimental support with ATen-Caffe2 integration)
+ Embedding (no optional arguments supported)
+ RNN
+ FeatureDropout (training mode not supported)
+ Index (constant integer and tuple indices supported)Usability Improvements
- More cogent error messages during indexing of Tensors / Variables
Breaking changes - Add proper error message for specifying dimension on a tensor with no dimensions
- better error messages for Conv*d input shape checking
- More user-friendly error messages for LongTensor indexing
- Better error messages and argument checking for Conv*d routines
- Trying to construct a Tensor from a Variable fails more appropriately
- If you are using a PyTorch binary with insufficient CUDA version, then a
warningis printed to the user. - Fixed incoherent error messages in
load_state_dict - Fix error message for type mismatches with sparse tensors
Bug fixes
torch
- Fix CUDA lazy initialization to not trigger on calls to
torch.manual_seed(instead, the calls are queued and run when CUDA is initialized)
Tensor
- if
xis 2D,x[[0, 3],]was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can dox[[0, 3]] x.sort(descending=True)used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.- Tensor constructors with numpy input:
torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))- torch will now copy the contents of the array in a storage of appropriate type.
- If types match, it will share the underlying array (no-copy), with equivalent semantics to initializing a tensor with another tensor.
- On CUDA,
torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))will now work by making a copy.
ones_likeandzeros_likenow create Tensors on the same device as the original Tensortorch.multinomialon the CPU would reshape the inputprob_distin-place. Fixed this to make sure theprob_distinput’s shape is unchanged after the call tomultinomialexpandandexpand_asallow expanding an empty Tensor to another empty Tensor- when
[..., None, ...]was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases. - Fix exponential distribution implementation to never sample infinity - cuRAND returns numbers in (0, 1]
- torch.HalfTensor supports
numpy()andtorch.from_numpy - Add additional size checking for
torch.scatter - fix
torch.trilandtorch.triuon the GPU for storage-offset Tensors (would return incorrect result). - Fix a memory leak in CUDA qr decomposition
- Fix stream-awareness issues in THCUNN kernels
- Fix kwargs parsing in
torch.topk - Fixed
random_on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensor - Fix
ZeroDivisionError: float division by zerowhen printing certain Tensors torch.gelswhenm > nhad a truncation bug on the CPU and returned incorrect results. Fixed.- Add a check in tensor.numpy() that checks if no positional arguments are passed
- Before a Tensor is moved to CUDA pinned memory, added a check to ensure that it is
contiguous anyandallwork on empty Tensors on the cpu (previously errored out)- Fix
symeigon CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior. - Improved the numerical stability of
torch.varandtorch.stdby using Welford’s algorithm - The Random Number Generator returned
uniformsamples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug).- Now, all
uniformsampled numbers will return within the bounds[0, 1), across all types and devices
- Now, all
- Fix
torch.svdto not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings) - Allows empty index Tensor for
index_select(instead of erroring out) - Previously when
eigenvector=False,symeigreturns some unknown value for the eigenvectors. Now we zero them out.
sparse
- Fix bug with ‘coalesced’ calculation in sparse ‘cadd’
- Fixes
.type()not converting indices tensor. - Fixes sparse tensor coalesce on the GPU in corner cases
autograd
- Fixed crashes when calling backwards on leaf variable with requires_grad=False
- fix bug on Variable
type()around non-default GPU input. - when
torch.normreturned0.0, the gradient wasNaN. We now use the subgradient at0.0, so the gradient is0.0. - Fix an correctness issue with advanced indexing and higher-order gradients
torch.prod‘s backward was failing on the GPU due to a type error, fixed.- Advanced Indexing on Variables now allows the index to be a LongTensor backed Variable
- Variable.cuda() and Tensor.cuda() are consistent in kwargs options
optim
torch.optim.lr_scheduleris now imported by default.
nn
- Returning a dictionary from a nn.Module’s forward function is now supported (used to throw an error)
- When
register_buffer("foo", ...)is called, and self.foo already exists, then instead of silently failing, now raises aKeyError - Fixed loading of older checkpoints of RNN/LSTM which were missing
_data_ptrsattributes. nn.Embeddinghad a hard error when using themax_normoption. This is fixed now.- when using the
max_normoption, the passed-in indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel. F.affine_gridnow can take non-contiguous inputs- EmbeddingBag can accept both 1D and 2D inputs now.
- Workaround a CuDNN bug where batch sizes greater than 131070 fail in CuDNN BatchNorm
- fix nn.init.orthogonal to correctly return orthonormal vectors when rows < cols
- if BatchNorm has only
1value per channel in total, raise an error in training mode. - Make cuDNN bindings respect the current cuda stream (previously raised incoherent error)
- fix grid_sample backward when gradOutput is a zero-strided Tensor
- Fix a segmentation fault when reflection padding is out of Tensor bounds.
- If LogSoftmax has only 1 element,
-infwas returned. Now this correctly returns0.0 - Fix pack_padded_sequence to accept inputs of arbitrary sizes (not just 3D inputs)
- Detect pointer aliasing in cuDNN RNN flatten_parameters and avoid that path.
- Fixed ELU higher order gradients when applied in-place
- Workaround a CuDNN RNN bug for half-precision
- Prevent numerical issues with
poisson_nll_losswhenlog_input=Falseby adding a small epsilon
distributed and multi-gpu
- Allow kwargs-only inputs to DataParallel. This used to fail:
n = nn.DataParallel(Net()); out = n(input=i) - DistributedDataParallel calculates num_samples correctly in python2
- Fix the case of DistributedDataParallel when 1-GPU per process is used.
- Allow some params to be
requires_grad=Falsein DistributedDataParallel - Fixed DataParallel to specify GPUs that don’t include GPU-0
- DistributedDataParallel’s exit doesn’t error out anymore, the daemon flag is set.
- Fix a bug in DistributedDataParallel in the case when model has no
buffers(previously raised incoherent error) - Fix
__get_state__to be functional inDistributedDataParallel(was returning nothing) - Fix a deadlock in the NCCL bindings when GIL and CudaFreeMutex were starving each other
Others
model.zoo.load_urlnow first attempts to use therequestslibrary if available, and then falls back tourllib- Fix error when default_collate is passed a collection of
numpy.str_
本文转载自: 掘金