Proto commits in UofT-EcoSystem/rlscope

These 34 commits are when the Protocol Buffers files have changed:

Commit:2916ed6
Author:James

Delete old unused files

The documentation is generated from this commit.

Commit:728c1a2
Author:James

- Better usability: add paths automatically at container start, better rlscope_help documentation - Naming updates: iml -> rlscope

Commit:0f6a209
Author:James

In the midst of renaming 'iml' -> 'rls'

Commit:573ee04
Author:James

In the midst of renaming 'iml' -> 'rls'

Commit:f67a96c
Author:James

In the midst of renaming 'iml' -> 'rls'

Commit:605d593
Author:James

In the midst of renaming 'iml' -> 'rls'

Commit:60103c8
Author:James

Start using "common_util" and "range_sampling" libraries; delete duplicate files. cmake is confusing as hell... lessons learned: - don't try to use "aliases" for library targets since find_package only appears to run once. - store *_LIBRARY and *_INCLUDE_DIRS in cached variables so they defined on subsequent builds / cmake configures.

Commit:5cb8cca
Author:James

Update cmake / setup.sh build for sanity; docker builds and install everything into local.docker (host uses local.host). Import range_profiling and common_util library code "as-is" from my custom CUPTI-Samples repo; need integrate/use within RLScope project.

Commit:11e7052
Author:James Gleeson

Trying to get minigo from mlperf_training repo running again (branch=iml). Realizing that in order to trace and analyze multiple generations, we need to support the same "training phase" and "process name" being launched multiple times. To do that, we need some notion of "phase_ident" and "process_ident" that is unique (even if phase_name and process_name are identical). Otherwise it becomes difficult to know how to "group" events within phases/processes at analysis time.

Commit:a766ade
Author:James Gleeson

Record the number of overhead events so we can subtract them later on. Was attempting to implement subtraction "later on" by modifying the venn_js files directly, but realized that this is a "hacky"/error-prone way implement handling profiling overhead. Next steps: - "inject" profiling overhead, and add a new CPU category CATEGORY_PROFILING_OVERHEAD. - Need to think about how to handle this in our plots; ideally we will want to "subtract it entirely"... won't we end up with the same problem? A: No; we can handle it during the overlap computation, by doing: overlap[overhead, CPU, GPU] -> overlap[GPU] overlap[overhead, CPU] -> None - Need to parallelize overlap computation and reduce memory usage by performing it in "chunks" that we grab from the SQL database. We can have a pool of worker threads that run a chunk at a time.

Commit:87b6837
Author:James Gleeson

Handle python profiling overhead. Looks like there's still "inconsistent bias" when handling CUPTI overhead; need to investigate whether this is due to CUDA API-calls "becoming slower" over time.

Commit:adf7dab
Author:James Gleeson

Trace GPU kernel start/end timestamps. $ iml-prof --cuda-activities python train.py Trace CUDA API start/end timestamps. $ iml-prof --cuda-api-events python train.py

Commit:2049e38
Author:James Gleeson

Import some code for recording GPU start/end timestamps and dumping it async; doesn't compile, needs refactoring first.

Commit:99e51a6
Author:James

Revert "In the midst of porting over remaining TensorFlow GPU event and" This reverts commit 0bd126b03f87f4291952711f3576943e060e21c4.

Commit:0bd126b
Author:James Gleeson

In the midst of porting over remaining TensorFlow GPU event and CPU-side-event tracing code. NOTE: does not compile.

Commit:18ca4fd
Author:James Gleeson

Found that using LD_PRELOAD to intercept cudaLaunchKernel adds less overhead than using libcupti to intercept it; hard to say what CUDA is doing beneath the covers to add overhead (perhaps correlation ID, or extra memory allocations for tracking things). Used LD_PRELOAD interception of cudaLaunchKernel to collect # of cudaLaunchKernel calls and total usec spent in cudaLaunchKernel. Rough number appear to show consistent time spent in cudaLaunchKernel across measured runs (8.5 usec), which suggests we will be able to get meaningful "average overhead" measurements. $ iml-prof --cuda-api-profile python train.py Next steps: - fuzz CUDA API for other API calls we need to trace - Port tensorflow protobuf stat-gathering to my tracer - make a graph of CUDA API times with/without instrumentation (TensorFlow tracing, libcupti) on.

Commit:3ab7584
Author:James Gleeson

Trying out minimal PC sampling calls by printing out PC sampling records. $ iml-prof --pc-sampling python train.py Need to switch development to mel-17 (Quadro P4000) which has PC sampling support. Added some boiler-plate code for recording sampling state once PC sampling is running; haven't tested it at all, and still need to hook it up to the PC sampling cupti callbacks.

Commit:5a0adcf
Author:James Gleeson

Enable CUDA API profiling by doing: $ iml-prof --cuda-api-calls python train.py Need to start dumping traces to protobuf files.

Commit:3e508a9
Author:James Gleeson

Get wrapper library compiling. Added backward-cpp stack-trace library to help with debug failed runtime checks. Need to debug api stat collection and printing.

Commit:7290468
Author:James Gleeson

- Account for and subtract pyprof overhead. If we run with pyprof enabled, and subtract pyprof overhead, we estimate 2.5% (less) than the uninstrumented training time. Still need to check if this 2.5% is within standard error of measurement, but this is promising. - Next: record libcupti profiling overhead numbers, and see if it accounts for the 40% in the PPT slides.

Commit:b9f5a9b
Author:James Gleeson

- Collect profiling overhead plot for all the OpenAI workloads; it's high across the board (~ 200%). - Next steps: dig into what part of IML is causing the profiling overhead by incrementally disabling features.

Commit:70eec3b
Author:James Gleeson

Was in the progress of getting train_minigo.sh working (not quite there yet, early agent is struggling to play games during loop_train_eval's since it keeps resigning during all the games it plays). Context switch: trying to debug issues Sri encounters on long training runs.

Commit:a7ff270
Author:James Gleeson

- Working on collecting CPU/GPU memory utilization

Commit:e65dfe5
Author:James Gleeson

- In progress: working on adding multi-machine support, but still to finish the iml-analyze part, since that part generally assumed a single machine.

Commit:232b273
Author:James Gleeson

Add support for running on AMD ROCm GPUs. Dockerized all the ROCm stuff as well. Am a bit foggy where I left off on this though...

Commit:1ddf729
Author:James Gleeson

Create a .whl package using setup.py. Still need to do this for iml-drill. import iml_profiler # All APIs for profiling are accessed @ iml_profiler.* # e.g. with iml_profiler.prof.operation('func'): ...

Commit:185f340
Author:James Gleeson

Create basic unit-test: Atari pong. Still need to make one for minigo.

Commit:76f3352
Author:James Gleeson

- Create script for sampling CPU/GPU utilization. - Working on figuring out how to use matplotlib to generate "heat-scale" from utilization data. Figured out how to generate heat-scale squares; however it looks hard to precisely generate pixel-to-second accuracy with matplotlib, so we may want to use html/css instead (which isn't actually that hard I don't think).

Commit:3011521
Author:James Gleeson

Debugging stuff / adding utilization metrics

Commit:6fca21e
Author:James Gleeson

In the midst of adding phases.

Commit:8346706
Author:James Gleeson

Store tracing results into an SQLite database. This is in preparation for supporting tracing multiple processes. Need to re-write some plot generation code to read from SQLite files instead of raw protobuf files, and separate data on a per-operation basis based on overlap with "Operation" events (e.g. q_forward).

Commit:daf92bb
Author:James Gleeson

Working on benchmarking minigo; need to support tracing multiple concurrent python scripts. Need to remove the step-centric idempotent-centric code.

Commit:d658c79
Author:James Gleeson

Add pyprof events to timeline. Clearly they are off by ~ 1 second from the GPU/CUDA events. Still not sure why... going to look into tensorflow code and whether it does anything to the raw timestamps (e.g. adjusting them) before outputting them for tfprof.

Commit:fdbd0a4
Author:James Gleeson

Create overlap plot that factors in overlap between CPU/GPU execution.