Proto commits in ezyang/pytorch-unattached

These 58 commits are when the Protocol Buffers files have changed:

Commit:9aa92bc
Author:Junjie Bai
Committer:Facebook Github Bot

Change the default value of DeviceOption.numa_node_id from -1 to 0 (#10877) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10877 change default value of DeviceOption.numa_node_id to 0 and use has_numa_node_id() to check existence Reviewed By: ilia-cher Differential Revision: D9473891 fbshipit-source-id: 91ac6a152f445644691023110c93d20a3ce80d43

Commit:40109b1
Author:Yangqing Jia
Committer:Facebook Github Bot

Remove caffe1 specific proto (#10380) Summary: This was used as a convenient way for us to convert c1 models. Now that conversion is more or less done, we should probably require any users who need to convert c1 models to explicitly install c1. This PR removes the explicit c1 proto (which was copied from c1) in favor of explicit installation. Note that caffe_translator would still work properly, only difference is that now users need to install c1 separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/10380 Differential Revision: D9267981 Pulled By: Yangqing fbshipit-source-id: a6ce5d9463e6567976da83f2d08b2c3d94d14390

Commit:3693941
Author:Edward Yang
Committer:Facebook Github Bot

Introduce at::DeviceType, which subsumes at::Device::Type and (partially) caffe2::DeviceType (#10175) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10175 Previously, we had at::Device::Type and caffe2::DeviceType (from protobuf), intended to help us distinguish between CPU, CUDA, etc. devices. This replaces at::Device::Type entirely with at::DeviceType, which in turn is a direct, 'enum class' version of the protobuf generated caffe2::DeviceType 'enum'. We can't eliminate the 'enum' because this would a pretty drastic API change (enum is interconvertible with integers, enum class is not) but we can make the two line up exactly and share code for, e.g., printing. Reviewed By: Yangqing Differential Revision: D9137156 fbshipit-source-id: 566385cd6efb1ed722b25e6f7849a910b50342ab

Commit:553c41f
Author:Yavuz Yetim
Committer:Facebook Github Bot

Adds serialization path (#9035) Summary: Closes https://github.com/pytorch/pytorch/pull/9035 This diff builds on the structure in the stacked diff to add serialization/deserialization. It supports the old format and a new suggested format. Reviewed By: ilia-cher Differential Revision: D8415115 fbshipit-source-id: acaacce2b015f4c6ac0ae22625455290a3f30262

Commit:edb88b5
Author:Orion Reblitz-Richardson
Committer:GitHub

Update from Facebook (#8887) * add opencl + fpga context adds an opencl context inside caffe2/fb which can be used for fpga access * [Caffe2] Force tensor inference checks to be triggered during testing We've started to rely on TensorInference functions more for different analysis. This diff ensures that the TensorInference function's result matches what is expected from the definition of the operator. * Enable building //caffe2:torch with @mode/opt In @mode/opt, python runs out of a PAR, which breaks a lot of assumptions in the code about where templates/ folders live relative to __file__. Rather than introduce hacks with parutil, I simply turn template_path into a parameter for all the relevant functions and thread it through from the top level. * [Caffe2] Fix cost models for DotProduct and Div. Update Tensor Inference for dot product As title. DotProduct states that output is a 1-D tensor (https://caffe2.ai/docs/operators-catalogue.html#dotproduct) though code suggests it is either 0- or 1-D depending on inputs. TensorInference defined to support implementation. * [SG-MoE] Add an option to make the experts NOT as components * [nomnigraph] Rename and fixup convertToNeuralNetOperator API This will make things a bit cleaner * no longer symlink THNN.h and THCUNN.h * forced decoder network (onnx export) Closes https://github.com/pytorch/translate/pull/95 Add networks in ensemble_export.py to create a forced decoding network from PyTorch NMT checkpoints. This network takes an arbitrary numberized (source, target) pair and returns the model score for the translation, including penalties. Vocabulary reduction networks are also supported, but note that target indices which are not in the possible_translation_tokens generated for the source input will be trea * Revert schema change to fix production models Revert schema change to fix production models * MockLogDeviceReader - rebase on FIX # Goal 1), Build a make_mock_log_device_reader using make_mock_reader 2), Replace the real log_device_reader here: https://fburl.com/raihwf1p # Log by D8151734 Real log_device_reader: ``` I0529 20:29:05.373108 954994 tensor.h:839] Tensor print_net/log of type std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >. Dims: (): read_net/ParseOpenTrainingRow:0 I0529 20:29:05.373244 954994 tensor.h:839] Tensor read_net/ParseOpenTrainin * [C2/D2][1/n]: Nonnegative-Constrained Optimization -- log barrier implement log barrier as a regularization method * Add teacher weight screening. Add teacher weight sceening according to teacher labels. If teacher label is zero, we do not use the distill loss in the objective function. * Add NormalizerContext See task for more detail. This implementation is a copy of what exists for RegularizerContext except for how the parameters are defined in the model_definition thrift file. I'll try an alternative implementation which overrides the default arguments of functions instead like for argscopes in tensorflow. https://github.com/pytorch/pytorch/compare/master...MaximeBoucher:update-from-facebook-0939578c068c?expand=1 * Adding cosine similarity option in dot processor Add pairwise cosine similarity option in dot product. Add an option to concate dot product and cosine similarity. Add test cases. * [nomnigraph][redo] Concat elim for sparseNN Same as D7962948, which was reverted because Operator Schema was not defined * [pytorch] Revert pytorch/pytorch#7918 'Release GIL when copying to shared memory', breaks ASAN Revert this pytorch diff that breaks ASAN when running Filament in dev mode; in opt mode it gives "bad file descriptor" errors. Looks like a race when copying tensors to shared memory in multiple mp.Queue's (which spawn separate threads). https://github.com/pytorch/pytorch/pull/7918/files * [nomnigraph][mobile] Enable nomnigraph by default, use -Oz on nomnigraph related code to reduce code size enables nomnigraph and reduces codesize * [Warmup] Allow both offline incremental training and online training Change plan name on saving side and reading side to support both training type This diff depends on D8128530 and D8168651. * Revert D7802642: [Warmup] Allow both offline incremental training and online training This reverts commit afc213cf9b36cecf75333a788391c4d09f4afccc @bypass-lint An infra SEV is better than not reverting this diff. If you copy this password, see you in SEV Review! @cause_a_sev_many_files * Add legacy grad logic to fix div op on old graphs. Add legacy grad logic to fix div op on old graphs. * Correctly propagate operator failures Propagate errors from operators that throw exceptions and return false * Revert D8374829: [caffe2][nomnigraph][redo] Concat elim for sparseNN This reverts commit 6dda028c463e54bb5c32188bbbe9202107e188a5 @bypass-lint An infra SEV is better than not reverting this diff. If you copy this password, see you in SEV Review! @cause_a_sev_many_files * [Caffe2] Added extra_info to core.DeviceOption(), enforced extra_info to be inherited in scope.DeviceScope extra_info is a newly defined field in DeviceOption proto. This diff added extra_info to the core.DeviceOption(). And, In scope.DeviceScope(), this diff enforce the new scope to inherit the extra_info from old scope. * [opt] hgdirsync wasn't enabled, merge diverged code Here's the damage, P59732616 basically xplat was left behind but had the change from assert to CAFFE_ENFORCE * OMP parallelism over RoIs for RoIAlign op Simpler to parallelize over RoIs. Shouldn't affect other uses as it relies on the number of OMP threads set during startup. PR: https://github.com/pytorch/pytorch/pull/8562 * Use int64_t for shape in FillOps to avoid overflow of int32 * Implement Rotated RoIAlign op Based on Rotated RPNs as explained in https://arxiv.org/abs/1703.01086. The idea is simple - orientation/angle is added as an RPN anchor parameter and then the angle is further regressed similar to bbox coords. There are some additional changes related to NMS and IoU, but besides that it's a direct extension to Faster-RCNN. Further details in https://fb.quip.com/sZHlA1iMfWPZ. RoIs are represented in [center_x, center_y, width, height, angle] format. `angle` repre * Rotated RoIAlign op CUDA forward implementation CUDA forward impl for D8415490 * RoIAlignRotated op CUDA backward pass implementation TSIA * All remaining fixes to eliminate process_github.sh Most of this diff has already been reviewed separately, except for the parts relating to _thnn/utils.py and _utils._internal.py remove skipIf(True, 'Fbcode') line from process_github.sh replace sed of cpp file with #ifdef to control cudnnDestroy use undo sync-time deletion of .gitattributes, remove process_github.sh switch to using _utils._internal rather than try-import-except This diff also fixes the open-source bug where rebuilds have * Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training" Original commit changeset: 7707d2efe60e The original diff is backout becuase the online trainer package is backed out. This code would only work with new online trainer package * [easy] improve error log in adagrad op as title * re-allow use of thnn_h_path This fixes cffi usage in OSS * [4/4] [tum] paralyzing layerNorm for GPU full sync as title * add compile=False to pytorch tests, remove hack with pyc * Add shape and type inference for RowWiseArgMax operator See title * Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training" This reverts commit 78167eeef0af16b60f72c82f9dcdda9b41b4dcbd @bypass-lint An infra SEV is better than not reverting this diff. If you copy this password, see you in SEV Review! @cause_a_sev_many_files * [fix-flaky-test] mock_hive_reader_test flaky, because GlobalCounter collects local counts intervally # Problem `MockHiveReader` uses `GlobalCounter` to limit `max_examples`. GlobalCounter on server node collect local counts from worker nodes every 1 sec. This 1 sec delay makes it impossible to limit exactly to the `max_examples`, it will definitely exceed `max_examples`. # Plan Given, ``` Expected num_examples = max_examples + num_examples/sec (Read Speed) x 1 sec (GlobalCounter Sync Int * [Caffe2] Fix FCGradient cost inference. Prevent overflow in cost inference FCGradient missed a factor 2 in the `num_outputs == 3` case. Overflow was occurring with flop calculation for FC. Changed types to `uint64_t` to prevent future problems. * Fix binary ops with empty inputs Fix binary ops with empty inputs * Support the filling of input blob with provided data as title for Biz Integrity case * Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"" Original commit changeset: 30c55dd38816 Original diff is reverted due to introducing bad integration test. Fixed the integration test. * [c2][easy] improve pack ops error loggings as desc. * Add ShapeTypeInference for LpNorm operator As desc * Shard test_nn to reduce runtime for each test target Closes https://github.com/pytorch/pytorch/pull/8793 The current test_nn would time out and be disabled in GreenWarden, and we need to have an option to split it up in order to pass the stress test. Right now GreenWarden roughly allows running 100 test cases in test_nn before timing out, and here we have an option to divide test_nn into 30 shards (with ~40 tests in each shard) to allow for some test suite growth in the future. * Change default caffe2_streams_per_gpu to 1 * Remove IN_SANDCASTLE from common.py and test_nn.py We prefer to disable the failing tests through Sandcastle UI instead. * Add a new class for an updated prof_dag.proto This diff contains: - An updated prof_dag.proto that contains blob profiles. - A class to deserialize this information (serialization is in a follow up diff) - Update to separate profiling information from NeuralNet (and use it as part of the class above). - Unit tests * Lambdarank for SparseNN This diff adds a lambda_rank_layer for SparseNN. changes include 1) Adds support for multi sessions in c2 op 2) Adds support for two different loss functions in c2 op 3) Unit tests for op * Revert D8586950: Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"" This reverts commit 012220ed63eccc35659a57b31d16a3625da6317b @bypass-lint An infra SEV is better than not reverting this diff. If you copy this password, see you in SEV Review! @cause_a_sev_many_files * [easy] A few fixups to multithread predictor benchmark (1) support perf on T6 server (2) remove dead code * fix a bug about the map size as title * Fix reduce sum on in-place case. Fix reduce sum on in-place case. * [Warmup] Reland reverted diff Allow both offline incremental training and online training Closes https://github.com/pytorch/pytorch/pull/8827 fix net transform integration test. Allow offline and online trainer to coexist D7802642. * Add StoreHandlerNotAvailableException Add an exception for a store that is not available or has been deleted. * Use exception handling for fault tolerance, missing KV store Remove status blobs to communication ops so that exceptions propagate on failure. * [C2/D2][2/n]: Nonnegative-Constrained Optimization -- bounded grad proj for simple bounded constrained optimization, incl non-negative box constraints. * [GanH]: Adaptive Weighting with More Estimations With implemented postivity optimization, we now learn adaptive weights with different parameterizations. This improves parameter estimation and training stability. * Revert some changes for landing * Remove AutoNoGIL in StorageSharing * Temporarily disable net_tests * Revert "[Caffe2] Force tensor inference checks to be triggered during testing" This reverts commit 67ef05c22b2f71b4a489695384932f968384a2a4. * Revert "Fix reduce sum on in-place case." This reverts commit 6cb8a8e1b3db7b6d20941b0053e3f3836068eb64. * Revert "Revert "Fix reduce sum on in-place case."" This reverts commit 130a257c0893dc09f4bd6e6a45d112261807fd2c.

Commit:7ca8e2f
Author:Yangqing Jia
Committer:Lu Fang

fix old comment to point to the right file (#8416)

The documentation is generated from this commit.

Commit:e5b9972
Author:bddppq
Committer:GitHub

[Caffe2] Enabling AMD GPU Backend for Caffe2 (#7955) * Add hip support for caffe2 core * Add MIOPEN header/wrapper to caffe2 core * Add HIP device into caffe2 PB * top level makefile change for rocm/hip * makefile scaffolding for AMD/RocM/HIP * Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files * caffe2 PB update for AMD/ROCM HIP device * Add AMD/RocM/Thrust dependency * HIP threadpool update * Fix makefile macro * makefile fix: duplicate test/binary name * makefile clean-up * makefile clean-up * add HIP operator registry * add utilities for hip device * Add USE_HIP to config summary * makefile fix for BUILD_TEST * merge latest * Fix indentation * code clean-up * Guard builds without HIP and use the same cmake script as PyTorch to find HIP * Setup rocm environment variables in build.sh (ideally should be done in the docker images) * setup locale * set HIP_PLATFORM * Revert "set HIP_PLATFORM" This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad. * continue the build script environment variables mess * HCC_AMDGPU_TARGET * Cleanup the mess, has been fixed in the lastest docker images * Assign protobuf field hip_gpu_id a new field number for backward compatibility * change name to avoid conflict * Fix duplicated thread pool flag * Refactor cmake files to not add hip includes and libs globally * Fix the wrong usage of environment variables detection in cmake * Add MIOPEN CNN operators * Revert "Add MIOPEN CNN operators" This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a. * Resolve merge conflicts * . * Update GetAsyncNetHIPThreadPool * Enable BUILD_CAFFE2 in pytorch build * Unifiy USE_HIP and USE_ROCM * always check USE_ROCM * . * remove unrelated change * move all core hip files to separate subdirectory * . * . * recurse glob core directory * . * correct include * .

Commit:82b981e
Author:Bram Wasti
Committer:Soumith Chintala

Update from facebook 1ee4edd286a3 (#8040) * Adding instance weight to batch distill loss as title * add bfloat 16-31 added bfloat 16-31 and their respective unit tests * [CUDA9] Upgrade - fbcode CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan"). This diff can only be committed if: 1. CUDA 9 rpm is rolled out fleet-wide (TBD) 2. NVidia driver 390.40 is rolled out fleet-wide (done) 3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done) 4. Make sure all dependents are built (done) 5. Test all C2 operators, PyTorch (see test plan) * Share intermediate int32 buffer across Conv ops Adding a known type * [C2 fix] infer function for ensure_cpu_output_op this is adding the missing device funtion for ensure_cpu_output_op * [int8] Add blob serializer/deserializer for Int8TensorCPU To export to logfiledb * [nomnigraph] Add try catch block to optimization passes in predictor This will catch failures that happen in the optimization pass. * Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE CAFFE_ENFORCE uses strack trace fetcher. Which is currently a global static variable. If at static initialization time CAFFE_ENFORCE is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init functions registration, so we started to see this. Meyers singleton is going to provide safety here. If stacktrace fetcher was not registered yet, it will just use a dummy one. * NUMA support in SparseNN CPU benchmark Adding support for NUMA in SparseNN CPU benchmark * [mobile-roofline] Add logging needed for roofline model This should be all that's needed * Let the operators using the same input if the operators are not chained or else, we have to change the input data dims * fix null-pointer-use UBSAN errors in in reshape_op.h * revert previous fix on input blob name as title * Adding flag to let MineHardNegative automatically extract single value from dict Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case. * Reverting change that broke internal tests back to OSS compatible state

Commit:966c658
Author:bddppq
Committer:Orion Reblitz-Richardson

Revert "[Caffe2] Enabling AMD GPU Backend for Caffe2" (#7802) * Revert "[auto] Update onnx to 4898c9e - Added TensorDenotation and metadata_props for images (onnx/onnx#879) https://github.com/onnx/onnx/commit/4898c9e925b57ad4c58518515b98de66966ad3b1" This reverts commit 9c679dab5fe7cac27bb8c783fd143276e6046ef1. * Revert "Add BiasCHW fallback for GPU (#7738)" This reverts commit 14ad2e74f108d13ec98abb078f6aa7f01aae0aad. * Revert "[Caffe2] Enabling AMD GPU Backend for Caffe2 (#7566)" This reverts commit 2ebcf4bb37739733e76b754284cf8b2ffcba1c30.

Commit:2ebcf4b
Author:Peter Yeh
Committer:bddppq

[Caffe2] Enabling AMD GPU Backend for Caffe2 (#7566) * Add hip support for caffe2 core * Add MIOPEN header/wrapper to caffe2 core * Add HIP device into caffe2 PB * top level makefile change for rocm/hip * makefile scaffolding for AMD/RocM/HIP * Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files * caffe2 PB update for AMD/ROCM HIP device * Add AMD/RocM/Thrust dependency * HIP threadpool update * Fix makefile macro * makefile fix: duplicate test/binary name * makefile clean-up * makefile clean-up * add HIP operator registry * add utilities for hip device * Add USE_HIP to config summary * makefile fix for BUILD_TEST * merge latest * Fix indentation * code clean-up * Guard builds without HIP and use the same cmake script as PyTorch to find HIP * Setup rocm environment variables in build.sh (ideally should be done in the docker images) * setup locale * set HIP_PLATFORM * Revert "set HIP_PLATFORM" This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad. * continue the build script environment variables mess * HCC_AMDGPU_TARGET * Cleanup the mess, has been fixed in the lastest docker images * Assign protobuf field hip_gpu_id a new field number for backward compatibility * change name to avoid conflict * Fix duplicated thread pool flag * Refactor cmake files to not add hip includes and libs globally * Fix the wrong usage of environment variables detection in cmake * Add MIOPEN CNN operators * Revert "Add MIOPEN CNN operators" This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.

Commit:26ddefb
Author:Jinghui
Committer:Yinghai Lu

[feature request] [Caffe2] Enable MKLDNN support for inference (#6699) * Add operators based-on IDEEP interfaces Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Enable IDEEP as a caffe2 device Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Add test cases for IDEEP ops Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Add IDEEP as a caffe2 submodule Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Skip test cases if no IDEEP support Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Correct cmake options for IDEEP Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Add dependences on ideep libraries Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Fix issues in IDEEP conv ops and etc. Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Move ideep from caffe2/ideep to caffe2/contrib/ideep Signed-off-by: Gu Jinghui <jinghui.gu@intel.com> * Update IDEEP to fix cmake issue Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Fix cmake issue caused by USE_MKL option Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com> * Correct comments in MKL cmake file Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>

Commit:a73b3fd
Author:Bram Wasti
Committer:GitHub

[caffe2][opencl] Add OpenCL context (#6777)

Commit:6223bfd
Author:Orion Reblitz-Richardson
Committer:bddppq

Update from Facebook (#6692) * [GanH][Easy]: Add assertion to adaptive weighting layer 0 weight causes numeric instability and exploding ne * [Easy] Add cast op before computing norm in diagnose options As LpNorm only takes floats we add a manual casting here. * Introduce a new caching device allocator `cudaMalloc` and `cudaFree` calls are slow, and become slower the more GPUs there are. Essentially, they grab a host-wide (not device-wide) lock because GPU memory is transparently shared across all GPUs. Normally, this isn't much of a concern since workloads allocate memory upfront, and reuse it during later computation. However, under some computation models (specifically, memory conserving approaches like checkpoint-and-recompute, see https://medium.com/@yaroslavvb/fitting-larger-networks-into-memory-583e3c758ff9) this assumption is no longer true. In these situations, `cudaMalloc` and `cudaFree` are common and frequent. Furthermore, in data parallel contexts, these calls happen at nearly the same time from all GPUs worsening lock contention. A common solution to this problem is to add a custom allocator. In fact, nVIDIA provides one out of the box: CUB, which Caffe2 already supports. Unfortunately, the CUB allocator suffers from very high fragmentation. This is primarily because it is a "buddy" allocator which neither splits nor merges free cached blocks. Study https://github.com/NVlabs/cub/blob/1.8.0/cub/util_allocator.cuh#L357 if you want to convince yourself. This diff adapts a caching allocator from the Torch codebase https://github.com/torch/cutorch/blob/master/lib/THC/THCCachingAllocator.cpp which does splitting and merging and ends up working really well, at least for workloads like the checkpoint-and-recompute computation models noted above. I simplified the implementation a little bit, made it a bit more C++-like. I also removed a bunch of stream synchronization primitives for this diff. I plan to add them back in subsequent diffs. * Report reader progress in fblearner workflows Integrate with fblearner progress reporting API and add support to report training progress from reader nodes. If reader is constructed with batch limits, report based on finished batch vs total batch. The finished batch may be more than total batch because we evaludate if we should stop processing everytime we dequeue a split. If no limit for the reader, report based on finished splits (Hive files) vs total splits. This is fairly accurate. * [GanH][Diagnose]: fix plotting 1. ganh diagnose needs to set plot options 2. modifier's blob name is used for metric field can need to be fixed before generating net * Automatic update of fbcode/onnx to 985af3f5a0f7e7d29bc0ee6b13047e7ead9c90c8 * Make CompositeReader stops as soon as one reader finishes Previously, CompositeReader calls all readers before stopping. It results in flaky test since the last batch may be read by different threads; resulting in dropped data. * [dper] make sure loss is not nan as desc. * [rosetta2] [mobile-vision] Option to export NHWC order for RoIWarp/RoIAlign Thanks for finding this @stzpz and @wangyanghan. Looks like NHWC is more optimized. For OCR though it doesn't yet help since NHWC uses more mem b/w but will soon become important. * Intra-op parallel FC operator Intra-op parallel FC operator * [C2 Proto] extra info in device option passing extra information in device option design doc: https://fb.quip.com/yAiuAXkRXZGx * Unregister MKL fallbacks for NCHW conversions * Tracing for more executors Modified Tracer to work with other executors and add more tracing * Remove ShiftActivationDevices() * Check for blob entry iff it is present When processing the placeholders ops, ignore if the blob is not present in the blob_to_device. * Internalize use of eigen tensor Move use of eigen tensor out of the header file so we don't get template partial specialization errors when building other libraries. * feature importance for transformed features. * - Fix unused parameter warnings The changes in this diff comments out unused parameters. This will allow us to enable -Wunused-parameter as error. #accept2ship * add opencv dependencies to caffe2 The video input op requires additional opencv packages. This is to add them to cmake so that it can build * Add clip_by_value option in gradient clipping Add clip_by_value option in gradient clipping when the value is bigger than max or smaller than min, do the clip * std::round compat

Commit:9e71de3
Author:Dmytro Dzhulgakov
Committer:Dmytro Dzhulgakov

[core] Graph-level NUMA awareness in Caffe2 Adding NUMA awareness through numa_node_id in DeviceOption. Blobs of operators with numa_node_id are allocated on corr. memory banks, using CPU pools with NUMA affinity set to run operators.

Commit:4e5df5c
Author:Bram Wasti
Committer:Bram Wasti

added debug info to OperatorDef

Commit:4847f8c
Author:Yan Shang
Committer:Facebook Github Bot

Remove unused field in tensor proto Summary: This new field is not needed anymore, so this diff removes it Reviewed By: kennyhorror Differential Revision: D6316744 fbshipit-source-id: f8afc1c42a0592fd03c7939f8e6f78afc8510ec9

Commit:fc8532c
Author:Yangqing Jia
Committer:Facebook Github Bot

Allow serialization of custom types inside Tensor Summary: The use case is that sometimes we need a Tensor of custom type instead of POD or string. This diff allows one to delegate to BlobSerializerBase to further serialize the contents inside the Tensor. Design choices: (1) Each element is serialized as a BlobProto string, and stored in the repeated string field. (2) UNDEFINED is used as the enum value for the tensor data type, and the exact type string is stored in the additional field. (3) BlobSerializer is called on each item to obtain the serialized string. (4) This requires the custom type to have copy constructor - otherwise it will simply not be possible to copy over the deserialized content without explicit type. See blob_test.cc for an example. Reviewed By: sunnieshang Differential Revision: D6300196 fbshipit-source-id: 18bf94a22a07337e0fa83d3f1004b3651e38cf27

Commit:d2e94d0
Author:Bram Wasti
Committer:Facebook Github Bot

change device enums to be contiguous Summary: quick change Reviewed By: ajtulloch Differential Revision: D5976025 fbshipit-source-id: a5a1538a380edb7c3b0af76e74c2ccee09ecb928

Commit:49396c6
Author:Bram Wasti
Committer:Facebook Github Bot

add openglv2 to experimental Summary: only changes needing review are in proto_utils.cc and caffe2.proto Reviewed By: jerryzh168 Differential Revision: D5956743 fbshipit-source-id: e03fffaf5bc8413f2320c20a89a421f1a69b2870

Commit:68f3584
Author:Alisson Gusatti Azzolini
Committer:Facebook Github Bot

Add node_name to DeviceOption Summary: Allow for generalizing net transforms. Reviewed By: Yangqing Differential Revision: D5812140 fbshipit-source-id: e3f30acad362ae1f0614ee218d331b525710b88e

Commit:e33dfe9
Author:Ilia Cherniavskii
Committer:Facebook Github Bot

Update proto definition Summary: Update Argument's definition to allow direct passing of NetDef Reviewed By: azzolini Differential Revision: D5681837 fbshipit-source-id: e6c618bff051f9bbc56075c796aeba0094fa97dd

Commit:65112f3
Author:Yangqing Jia
Committer:Facebook Github Bot

code cleanup: separate the several net implementations to separate files. Summary: TSIA. Reviewed By: harouwu Differential Revision: D5670906 fbshipit-source-id: 507e789978144341bf696fb20dc11f3c2d55493b

Commit:5d24a4e
Author:Yangqing Jia
Committer:Facebook Github Bot

Early design for a general Event abstraction cross-devices. Summary: There are ad-hoc efforts on avoiding excessive device synchronizations, such as async_dag, singlethread_async, etc. This diff aims to provide an early design for a general Event class, that can achieve the following: (1) It is device agnostic, essentially using a vtable to do cross device record, wait and synchronization. (2) Created new functions WaitEvent and Record in the Context class for interacting with Events. (3) Exposed the corresponding WaitEvent and Record functions in the OperatorBase class as well. An example use case is that, after potential future refactoring, one can achieve a real async execution per operator by running op.WaitEvent(previous_event); op.RunAsync(); op.RecordEvent(this_op_event); and the next op can do next_op.WaitEvent(this_op_event); Right now, I changed async_dag net implementation so that it uses the general event design. The old Event class is assimilated to the general Event class and the old Stream class is now essentially taken over by the Context class itself. Reviewed By: harouwu Differential Revision: D5648463 fbshipit-source-id: 58bd84d06e4a9977b0b835110ddb2f18be3b7cbc

Commit:14950a9
Author:Lei Chen
Committer:Facebook Github Bot

Support session in distributed realtime trainer Summary: Convert from PlanDef ProtoBuf into python Plan object by recursively creating Nets and ExecutionSteps. Also support running Plan object directly in Session. Reviewed By: azzolini Differential Revision: D5608393 fbshipit-source-id: c0ae3b6da743a759af6db3b614a5a3935fe0b34c

Commit:7df8598
Author:Dmitrii Podoprikhin
Committer:Facebook Github Bot

Added functionality that allows users to store huge blobs Summary: Added functionality that allows users to store huge blobs of any type not only Tensors. Blob has to be divided into chunks in the same way as Tensor blob. Reviewed By: kennyhorror Differential Revision: D5432762 fbshipit-source-id: c171faacd99d209bfae6f9707ebde7c4e23ba3b9

Commit:7d48274
Author:Alisson Gusatti Azzolini
Committer:Facebook Github Bot

Allow tasks/execution_steps to be cloned at runtime Summary: Advantages of cloning the tasks/execution_steps at runtime: - Less complexity on the python side: no need to clone nets and add prefixes to blob names - Faster start-up: we had cases of complex plans that took up to 30min to be created. - Better isolation: each task cloned at runtime has its own child workspace, preventing false sharing of blobs. - Opens up possibility for dynamic scheduling: Number of threads per task can be increased on the fly, at runtime. Reviewed By: dzhulgakov Differential Revision: D5100730 fbshipit-source-id: 71b83193b135da4e6eaf2536d8fc266528e1fdcc

Commit:83e6a0b
Author:Alexander Sidorov
Committer:Facebook Github Bot

Revert uuid change to OperatorDef protobuf Summary: a few issues: 1. Randomization hurts memoization 1. Even if we make it non random, then we can get key colisions when loading it back. 2. RNNs use prototxt for step net and apparently its not forward compatible like normal protobuf is I am thinking of a better less invasive solution now. Reviewed By: jamesr66a Differential Revision: D5272118 fbshipit-source-id: ab577fad04fbfc632e1fceffa923377a0d3da1be

Commit:2ec294a
Author:haracejacob
Committer:Facebook Github Bot

Fix a few typos and grammars in comment Summary: Fix a few typos and grammars in comment by using language-check, python library spell_checker source code is here : https://github.com/17-1-SKKU-OSS/011A/blob/master/spell_checker/spell_checker.py here is the text file which indicates what things should be fixed : https://github.com/17-1-SKKU-OSS/011A/tree/master/spell_checker/fix/caffe2 Closes https://github.com/caffe2/caffe2/pull/719 Differential Revision: D5165118 Pulled By: aaronmarkham fbshipit-source-id: 7fb8ef7a99d03cd5fd2f9ebdb01b9865e90fc37b

Commit:eebda50
Author:Alexander Sidorov
Committer:Facebook Github Bot

Operator python traceback Summary: This is going to show a python Caffe2 user where a failed operator was created. Motivation for having this information not right in protobuf is to avoid having it too verboose and keep ability to read protobufs of a net after a simple print() call. Reviewed By: jamesr66a Differential Revision: D5226047 fbshipit-source-id: 7edfe850e05a2ec209577142aa3368664a57a108

Commit:3b0069a
Author:Mohammad Hossain
Committer:Yangqing Jia

Expose operators execution statistics to python frontend. Summary: To expose operators execution statistics in python, profiling measurements collected in ProfDAGNet class is leveraged. In current implementation, a new operator is defined that outputs the statistic data in a protobuf message. In the frontend, OperatorStatsContainer works as a wrapper to print ProfDAGNet statistics. Differential Revision: D4923009 fbshipit-source-id: 18a6d76a405ef277a3fca7a312609051cf943207

Commit:39fa092
Author:Fei Sun
Committer:Facebook Github Bot

Constant string is generated from Protobuf instead of Thrift Summary: To make the predictor open souorce, move the constants that are generated from Thrift to Protobuf. Reviewed By: salexspb Differential Revision: D4656884 fbshipit-source-id: d4dbb3416e8396185e0981fcd9a090fbb054a18a

Commit:834142b
Author:Fei Sun
Committer:Facebook Github Bot

Change the predictor to use Protobuf Reviewed By: salexspb Differential Revision: D4644798 fbshipit-source-id: 0cf96dfc9061f87978a57d2fedcfe4a0bb012405

Commit:41a3ec2
Author:Bram Wasti
Committer:Facebook Github Bot

QTensor serialization/deserialization Summary: Added protobuf style serialization/deserialization w/o chunking for qtensors Reviewed By: salexspb Differential Revision: D4622677 fbshipit-source-id: 1f845ad773a61b7ae2c362ec31d8de04e4217f68

Commit:8dff5a8
Author:Fei Sun
Committer:Facebook Github Bot

Change the type of content in BlobProto from string to bytes Summary: We are converting MetaNetDef from thrift to protobuf. The protobuf is binary encoding. Since bytes is a superset of string. Change the field to bytes so that no warning is generated when compiling caffe2. Reviewed By: Yangqing Differential Revision: D4635581 fbshipit-source-id: 916b799e1fb9466658e1dd198bfb5c6928f22488

Commit:0293790
Author:Aapo Kyrola
Committer:Facebook Github Bot

add inference for gradient ops + a couple of missing shape inference functions + fix to scalars Summary: A bit too much stuff in one diff, so sorry: 1. Add inference for gradient types by using the fact that x_grad is gradient of x and must be of same shape. This is kind of awkward to use string matching, but in addition I rely on the operator being actually a gradient op. 2. dzhulgakov was write, scalar shape is () and not (1). Sorry, my claim easlier was #fakenews. 3. Added inference functions for MakeTwoClass, MomentumSGDUpdate and Cross entropy ops. Reviewed By: dzhulgakov Differential Revision: D4569758 fbshipit-source-id: 0db13f33819777fdddefe21d4b1ebf906fcaf98c

Commit:8fa156d
Author:Alisson Gusatti Azzolini
Committer:Facebook Github Bot

Improve "reporter net" design Summary: Previously we had several limitations for a reporter net: - needed to be a net, not an execution step - only one allowed per execution step, with a single interval Now, "reporter nets" become repoter steps and multiple of them can be specified with different timeouts. Reviewed By: dzhulgakov Differential Revision: D4583686 fbshipit-source-id: ad7266e16f96e7829fd24dcc1f165f39e9db573d

Commit:dcefc74
Author:Aapo Kyrola
Committer:Facebook Github Bot

Shape and Type Inference Part1 Summary: This is a bit large diff, sorry about it. It includes basic shape and type inference functionality, based on YQ's Schema scaffolding. I added some helper functions to make it easier to write simple translations. Bigger refactoring was needed for ConvPoolBase so that we could use the shape inference already there in the schema. I annotated enough operators to be able to infer forward-pass of shapes for basic convnet, and added test for that. I intend to bootcamp some annotations and annotate enough to handle Resnets fully. Need to think about gradients, if they could be annotated in an easier way. Only shapes are now exposed to Python, types will follow later. Also the inference is not called yet anywhere but unit test. Also I am not sure if everything is in the best location in the code, but shouldn't be hard to move stuff around. Reviewed By: dzhulgakov Differential Revision: D4436818 fbshipit-source-id: eebee5937ccc9ac09c245465302388a1fae6933c

Commit:2790043
Author:Yangqing Jia
Committer:Bram Wasti

Add the MKLDNN type to the tensor type strings and added proper docs. Summary: TSIA Reviewed By: dzhulgakov Differential Revision: D4217541 fbshipit-source-id: f68d1aba9c20af0fb0aed2cc1b2099961f6fa7a4

Commit:5893989
Author:Yangqing Jia

fbsync at f5a877

Commit:238ceab
Author:Yangqing Jia
Committer:Yangqing Jia

fbsync. TODO: check if build files need update.

Commit:c15e45c
Author:Yangqing Jia

chunky sync again

Commit:3c98934
Author:Yangqing Jia

caffe translator with added back legacy pooling support

Commit:bcea409
Author:Yangqing Jia
Committer:Yangqing Jia

sync

Commit:f01f206
Author:Yangqing Jia

bring up caffe.proto to master

Commit:09bed67
Author:Yangqing Jia

add untracked files

Commit:6463eeb
Author:Yangqing Jia

chunky sync - build scripts to be written

Commit:559053d
Author:Yangqing Jia

chunky sync

Commit:ecd9507
Author:Yangqing Jia

minot changes

Commit:fa59b90
Author:Yangqing Jia

misc updates

Commit:6bfb300
Author:Yangqing Jia

deprecate legacy pooling

Commit:0546578
Author:Yangqing Jia

optionally use protobuf lite

Commit:98c5b86
Author:Yangqing Jia

A few changes: (1) cudnn for conv (2) cublas: after going through the work I feel it's beter to use HOST pointer mode, so changed it. (3) storage order: despite that googlenet and multibox uses NHWC, it seems better to be still using NCHW as default to be consistent with caffe and cudnn; moved to NCHW as default.

Commit:648d1b1
Author:Yangqing Jia

A consolidation of a couple random weekend work. (1) various bugfixes. (2) Tensor is now a class independent from its data type. This allows us to write easier type-independent operators. (3) code convention changes a bit: dtype -> T, Tensor<*Context> -> Tensor* alias. (4) ParallelNet -> DAGNet to be more consistent with what it does. (5) Caffe's own flags library instead of gflags. (6) Caffe's own logging library instead of glog, but glog can be chosen with compile-time definition -DCAFFE2_USE_GOOGLE_GLOG. As a result, glog macros like CHECK, DCHECK now have prefix CAFFE_, and LOG(*) now becomes CAFFE_LOG_*. (7) an optional protobuf inclusion, which can be chosen with USE_SYSTEM_PROTOBUF in build_env.py.

Commit:47c70a4
Author:Yangqing Jia
Committer:Yangqing Jia

(1) minidb bugfix (2) blob serialization comments (3) cudnn: putting it under a separate device name so we can explicitly choose cudnn instead of having CUDA device prioritizing it. (4) note that mint is not available with ipython due to zeromq conflict (5) db_throughput utility (6) added gprofiler

Commit:036229c
Author:Yangqing Jia

[style] Finishing name changes for the rest of the fields in the protobuf.

Commit:c107740
Author:Yangqing Jia
Committer:Yangqing Jia

[style] Massive name change from plural to singular. This one changes operator’s “inputs” and “outputs” to “input” and “output”, and "args" to "arg".

Commit:2abd5e2
Author:Yangqing Jia

GoogleNet adaption - added yet another legacy padding support.

Commit:2ed1077
Author:Yangqing Jia

A clean init for Caffe2, removing my earlier hacky commits.