These 34 commits are when the Protocol Buffers files have changed:
| Commit: | 2916ed6 | |
|---|---|---|
| Author: | James | |
Delete old unused files
The documentation is generated from this commit.
| Commit: | 728c1a2 | |
|---|---|---|
| Author: | James | |
- Better usability: add paths automatically at container start, better rlscope_help documentation - Naming updates: iml -> rlscope
| Commit: | 0f6a209 | |
|---|---|---|
| Author: | James | |
In the midst of renaming 'iml' -> 'rls'
| Commit: | 573ee04 | |
|---|---|---|
| Author: | James | |
In the midst of renaming 'iml' -> 'rls'
| Commit: | f67a96c | |
|---|---|---|
| Author: | James | |
In the midst of renaming 'iml' -> 'rls'
| Commit: | 605d593 | |
|---|---|---|
| Author: | James | |
In the midst of renaming 'iml' -> 'rls'
| Commit: | 60103c8 | |
|---|---|---|
| Author: | James | |
Start using "common_util" and "range_sampling" libraries; delete duplicate files. cmake is confusing as hell... lessons learned: - don't try to use "aliases" for library targets since find_package only appears to run once. - store *_LIBRARY and *_INCLUDE_DIRS in cached variables so they defined on subsequent builds / cmake configures.
| Commit: | 5cb8cca | |
|---|---|---|
| Author: | James | |
Update cmake / setup.sh build for sanity; docker builds and install everything into local.docker (host uses local.host). Import range_profiling and common_util library code "as-is" from my custom CUPTI-Samples repo; need integrate/use within RLScope project.
| Commit: | 11e7052 | |
|---|---|---|
| Author: | James Gleeson | |
Trying to get minigo from mlperf_training repo running again (branch=iml). Realizing that in order to trace and analyze multiple generations, we need to support the same "training phase" and "process name" being launched multiple times. To do that, we need some notion of "phase_ident" and "process_ident" that is unique (even if phase_name and process_name are identical). Otherwise it becomes difficult to know how to "group" events within phases/processes at analysis time.
| Commit: | a766ade | |
|---|---|---|
| Author: | James Gleeson | |
Record the number of overhead events so we can subtract them later on. Was attempting to implement subtraction "later on" by modifying the venn_js files directly, but realized that this is a "hacky"/error-prone way implement handling profiling overhead. Next steps: - "inject" profiling overhead, and add a new CPU category CATEGORY_PROFILING_OVERHEAD. - Need to think about how to handle this in our plots; ideally we will want to "subtract it entirely"... won't we end up with the same problem? A: No; we can handle it during the overlap computation, by doing: overlap[overhead, CPU, GPU] -> overlap[GPU] overlap[overhead, CPU] -> None - Need to parallelize overlap computation and reduce memory usage by performing it in "chunks" that we grab from the SQL database. We can have a pool of worker threads that run a chunk at a time.
| Commit: | 87b6837 | |
|---|---|---|
| Author: | James Gleeson | |
Handle python profiling overhead. Looks like there's still "inconsistent bias" when handling CUPTI overhead; need to investigate whether this is due to CUDA API-calls "becoming slower" over time.
| Commit: | adf7dab | |
|---|---|---|
| Author: | James Gleeson | |
Trace GPU kernel start/end timestamps. $ iml-prof --cuda-activities python train.py Trace CUDA API start/end timestamps. $ iml-prof --cuda-api-events python train.py
| Commit: | 2049e38 | |
|---|---|---|
| Author: | James Gleeson | |
Import some code for recording GPU start/end timestamps and dumping it async; doesn't compile, needs refactoring first.
| Commit: | 99e51a6 | |
|---|---|---|
| Author: | James | |
Revert "In the midst of porting over remaining TensorFlow GPU event and" This reverts commit 0bd126b03f87f4291952711f3576943e060e21c4.
| Commit: | 0bd126b | |
|---|---|---|
| Author: | James Gleeson | |
In the midst of porting over remaining TensorFlow GPU event and CPU-side-event tracing code. NOTE: does not compile.
| Commit: | 18ca4fd | |
|---|---|---|
| Author: | James Gleeson | |
Found that using LD_PRELOAD to intercept cudaLaunchKernel adds less overhead than using libcupti to intercept it; hard to say what CUDA is doing beneath the covers to add overhead (perhaps correlation ID, or extra memory allocations for tracking things). Used LD_PRELOAD interception of cudaLaunchKernel to collect # of cudaLaunchKernel calls and total usec spent in cudaLaunchKernel. Rough number appear to show consistent time spent in cudaLaunchKernel across measured runs (8.5 usec), which suggests we will be able to get meaningful "average overhead" measurements. $ iml-prof --cuda-api-profile python train.py Next steps: - fuzz CUDA API for other API calls we need to trace - Port tensorflow protobuf stat-gathering to my tracer - make a graph of CUDA API times with/without instrumentation (TensorFlow tracing, libcupti) on.
| Commit: | 3ab7584 | |
|---|---|---|
| Author: | James Gleeson | |
Trying out minimal PC sampling calls by printing out PC sampling records. $ iml-prof --pc-sampling python train.py Need to switch development to mel-17 (Quadro P4000) which has PC sampling support. Added some boiler-plate code for recording sampling state once PC sampling is running; haven't tested it at all, and still need to hook it up to the PC sampling cupti callbacks.
| Commit: | 5a0adcf | |
|---|---|---|
| Author: | James Gleeson | |
Enable CUDA API profiling by doing: $ iml-prof --cuda-api-calls python train.py Need to start dumping traces to protobuf files.
| Commit: | 3e508a9 | |
|---|---|---|
| Author: | James Gleeson | |
Get wrapper library compiling. Added backward-cpp stack-trace library to help with debug failed runtime checks. Need to debug api stat collection and printing.
| Commit: | 7290468 | |
|---|---|---|
| Author: | James Gleeson | |
- Account for and subtract pyprof overhead. If we run with pyprof enabled, and subtract pyprof overhead, we estimate 2.5% (less) than the uninstrumented training time. Still need to check if this 2.5% is within standard error of measurement, but this is promising. - Next: record libcupti profiling overhead numbers, and see if it accounts for the 40% in the PPT slides.
| Commit: | b9f5a9b | |
|---|---|---|
| Author: | James Gleeson | |
- Collect profiling overhead plot for all the OpenAI workloads; it's high across the board (~ 200%). - Next steps: dig into what part of IML is causing the profiling overhead by incrementally disabling features.
| Commit: | 70eec3b | |
|---|---|---|
| Author: | James Gleeson | |
Was in the progress of getting train_minigo.sh working (not quite there yet, early agent is struggling to play games during loop_train_eval's since it keeps resigning during all the games it plays). Context switch: trying to debug issues Sri encounters on long training runs.
| Commit: | a7ff270 | |
|---|---|---|
| Author: | James Gleeson | |
- Working on collecting CPU/GPU memory utilization
| Commit: | e65dfe5 | |
|---|---|---|
| Author: | James Gleeson | |
- In progress: working on adding multi-machine support, but still to finish the iml-analyze part, since that part generally assumed a single machine.
| Commit: | 232b273 | |
|---|---|---|
| Author: | James Gleeson | |
Add support for running on AMD ROCm GPUs. Dockerized all the ROCm stuff as well. Am a bit foggy where I left off on this though...
| Commit: | 1ddf729 | |
|---|---|---|
| Author: | James Gleeson | |
Create a .whl package using setup.py. Still need to do this for iml-drill. import iml_profiler # All APIs for profiling are accessed @ iml_profiler.* # e.g. with iml_profiler.prof.operation('func'): ...
| Commit: | 185f340 | |
|---|---|---|
| Author: | James Gleeson | |
Create basic unit-test: Atari pong. Still need to make one for minigo.
| Commit: | 76f3352 | |
|---|---|---|
| Author: | James Gleeson | |
- Create script for sampling CPU/GPU utilization. - Working on figuring out how to use matplotlib to generate "heat-scale" from utilization data. Figured out how to generate heat-scale squares; however it looks hard to precisely generate pixel-to-second accuracy with matplotlib, so we may want to use html/css instead (which isn't actually that hard I don't think).
| Commit: | 3011521 | |
|---|---|---|
| Author: | James Gleeson | |
Debugging stuff / adding utilization metrics
| Commit: | 6fca21e | |
|---|---|---|
| Author: | James Gleeson | |
In the midst of adding phases.
| Commit: | 8346706 | |
|---|---|---|
| Author: | James Gleeson | |
Store tracing results into an SQLite database. This is in preparation for supporting tracing multiple processes. Need to re-write some plot generation code to read from SQLite files instead of raw protobuf files, and separate data on a per-operation basis based on overlap with "Operation" events (e.g. q_forward).
| Commit: | daf92bb | |
|---|---|---|
| Author: | James Gleeson | |
Working on benchmarking minigo; need to support tracing multiple concurrent python scripts. Need to remove the step-centric idempotent-centric code.
| Commit: | d658c79 | |
|---|---|---|
| Author: | James Gleeson | |
Add pyprof events to timeline. Clearly they are off by ~ 1 second from the GPU/CUDA events. Still not sure why... going to look into tensorflow code and whether it does anything to the raw timestamps (e.g. adjusting them) before outputting them for tfprof.
| Commit: | fdbd0a4 | |
|---|---|---|
| Author: | James Gleeson | |
Create overlap plot that factors in overlap between CPU/GPU execution.