Proto commits in NVIDIA/cheminformatics

These 32 commits are when the Protocol Buffers files have changed:

2022-06-06

Commit:	55f7aa8
Author:	Rajesh Ilango	2022-06-06 08:36:20 -0700
Committer:	GitHub	2022-06-06 08:36:20 -0700

Busy indicator and eliminate duplicate records... (#153) Intermediate changes before changes in the plan Changes includes: - VR_ISSUE - Busy indicator for long running actions - VR_ISSUE - Missing busy indicator during molecule generation - VR_ISSUE - Remove duplicate smiles from generated molecule

Commit:	53c489c
Author:	Rajesh Ilango	2022-06-06 08:36:20 -0700
Committer:	GitHub	2022-06-06 08:36:20 -0700

2022-05-21

Commit:	be83bb1
Author:	Rajesh Ilango	2022-05-21 01:55:34 -0700

Intermediate changes

This commit does not contain any .proto files.

Commit:	24b7a8a
Author:	Rajesh Ilango	2022-05-21 01:55:34 -0700

Intermediate changes

This commit does not contain any .proto files.

2022-03-03

Commit:	234bcf9
Author:	Rajesh Ilango	2022-02-28 19:58:02 -0800
Committer:	Rajesh Ilango	2022-03-02 16:58:55 -0800

Intitial changes to separate CDDD model from the main application

Commit:	0f9d48d
Author:	Rajesh Ilango	2022-02-28 19:58:02 -0800
Committer:	Rajesh Ilango	2022-03-02 16:58:55 -0800

Intitial changes to separate CDDD model from the main application

2022-01-04

Commit:	efca500
Author:	Michelle Gill	2022-01-04 16:27:21 -0500
Committer:	GitHub	2022-01-04 16:27:21 -0500

Merge benchmark speed improvements and plotting code (#119) * Benchmark component refactoring... Promote cuchem.benchmark submodules as a separate module. With RAPIDS installed in NEMO inferening/training container, changes make for CPU is not tested. Other changes includes: - Use current user inside containers... With this change the problem with creation of folders with root user and group can be avoided. - Additional verbs in launch scripts... Two new verbs are added: - launch.sh - 'config': to create default .env file - setup/launch - download_model a new function to allow downloading mode during startup. - Code cleanup... - Remove unused configuration parameter - Move launch.py to __main__.py - Add separate directory for log in content path * Fix issues with bugs introduced during refactoring benchmark * Update hyperparams * Reduce splits, note about RF issues * Cleanup, reverting test values and comments for future changes * Fix typos * Add --gpus options on docker command * ExCAPE data failure on first execution... Due to difference in the column names during first time execution and subsequent execution. * Temporary move Ranking computation to CPU. CUDF version of ranking computation works for up to 1000 molecules, crashing on upto 3000 and above which it will freeze. * Bugfix CDDD and main application to address ripple effects from refactoring. * Physchem datasets updated * Add index column name * Bioactivity data added -- fingerprints too large * Dataset class updates * Metric code updates * Run script updates * Bug fix * Benchmarking bug * Bug fix for CDDD after code refactor. * Change to dev mode to start tests and benchmarks in background * Additional logs and changes to launch script * Fix the type conversion issue in NN correlation issue. * Plotting * Additional logging * More CSV friendly timestamp * Bugfix in bioactivity * Create CDDD dataset also * Remove plotting for now * Update replace SQL * Cleanup * Fix path * Add test * Cache embedding from DB into memory. * BUG FIX: ensure index not included in fingerprints features * BUG FIX: set data input size appropriately for dataset * BUG FIX: fix inconsistent indexes * Ensure CDDD class is importable * BUG FIX: ensure valid sampled molecules are canonicalized * Accidentally committed files * Move UI clustering benchmark * Move benchmark data prep scripts * Training data added to SQL after each file * Move yaml file to config dir * Minor bugfixes * Add logging to train debug generatio * Revert one change from merge * Move config file * Improve error checking for CSV data * Sampling metrics validated * Changes to return best predictions for plotting * Bug fix: bioactivity fingerprints & dataset loading * Fix ML model params * Fix config path in __main__.py * Update benchmark metrics yaml * Add seaborn to container * Changes to accept a inference wrapper outside of codebase. * Further optimization of benchmark code... - Separate out SMILES generation code. The idea is not to generate SMILES during benchmark testing. Almost always, we are generating the SMILES prior to running the benchmark code. Therefore, separing out this local generated SMILES db will make it easier to automate these steps and keep the code clean. * Add methods to fetch from local db containing generated molecules. * Notebook to find the intersection dataset between CDDD training data and benchmarking dataset. * Include the dataset size for context. * Remove CDDD train from ZINC15 test split * Update to list the common SMILES * Changes to create cache with additional details to reduce the time required to run benchmark runs. * Normalize data * Plotting completed * Nearest neighbor fix * Configurable normalization of embeddings * Add the code to compute Validity, Uniqueness and Novelity ratio * Change to include SMILES from all datasets. * Changes to improve the queries to allow multiple parameter * Change to introduce parallel query for generating sample * Fix bugs * Cleanup * Minor bugs * Bug fix and adjustments for parallelization * Minor bug * Changes made for embedding to work * Fix a mistake in commit. * Bugfix: Input_size for bioactivity was not use as intended. * Changes to remote individual databases and start using a commond db for all metrics. * Changes to improve runtime perf. * Fix review comments. * BUG FIX: ensure valid sampled molecules are canonicalized * Move UI clustering benchmark Co-authored-by: Rajesh K Ilango <rilango@gmail.com> Co-authored-by: Rajesh Ilango <rilango@nvidia.com>

Commit:	faa3ef6
Author:	Michelle Gill	2022-01-04 16:27:21 -0500
Committer:	GitHub	2022-01-04 16:27:21 -0500

2021-11-09

Commit:	456148a
Author:	Rajesh K Ilango	2021-11-09 12:15:43 -0800

Benchmark component refactoring... Promote cuchem.benchmark submodules as a separate module. With RAPIDS installed in NEMO inferening/training container, changes make for CPU is not tested. Other changes includes: - Use current user inside containers... With this change the problem with creation of folders with root user and group can be avoided. - Additional verbs in launch scripts... Two new verbs are added: - launch.sh - 'config': to create default .env file - setup/launch - download_model a new function to allow downloading mode during startup. - Code cleanup... - Remove unused configuration parameter - Move launch.py to __main__.py - Add separate directory for log in content path

Commit:	9b2e1ee
Author:	Rajesh K Ilango	2021-11-09 12:15:43 -0800

2021-09-07

Commit:	0d524e4
Author:	Rajesh Ilango	2021-08-27 10:49:12 -0700
Committer:	Rajesh Ilango	2021-09-07 14:39:23 -0700

Changes to improve runtime performance of benchmark tests. With this change intermediate results are stored in sqlite database. All results from generative models are stored in SQLite database. With this change, we reduce the total number of requests to MegaMolbart gRPC service by 3/4th. Additionally, training dataset is loaded into SQLite database and used while computing Novelity metric. With this change, a request to check Novelity is 2ms. Other changes include: - Upgrade to rapids 2021.06 - Clean dockerfile to remove all workarrounds - Remove the need for conda cuchem env inside the container. - Ability to select docker image to build using launch script - Add smile2embedding and embedding2smile to cddd - Use hydra for benchmark configuration

The documentation is generated from this commit.

2021-08-30

Commit:	a3cae7e
Author:	Rajesh Ilango	2021-08-27 10:49:12 -0700
Committer:	Rajesh Ilango	2021-08-30 10:06:05 -0700

Commit:	29e8193
Author:	Rajesh Ilango	2021-08-27 10:49:12 -0700
Committer:	Rajesh Ilango	2021-08-30 10:06:05 -0700

2021-08-27

Commit:	b25346e
Author:	Rajesh Ilango	2021-08-27 10:49:12 -0700

2021-08-02

Commit:	be79eb4
Author:	Rajesh Ilango	2021-07-28 01:32:04 -0700
Committer:	Rajesh Ilango	2021-08-02 13:43:11 -0700

Introduce inverse transform.

Commit:	cd4acb1
Author:	Rajesh Ilango	2021-07-28 01:32:04 -0700
Committer:	Rajesh Ilango	2021-08-02 13:43:11 -0700

Introduce inverse transform.

2021-07-13

Commit:	bb24fdb
Author:	Rajesh Ilango	2021-07-13 07:51:25 -0700
Committer:	Rajesh Ilango	2021-07-13 14:16:58 -0700

Bugfix: Molecular properties for generated molecules are always NAN... This was mainly due to changes in a newer version of RDkit. The module struture has changed. Also, there was no error message captured in the log file when this happened. Other changes includes: - Refactor launch.sh - Move common code between launch.sh and ngc resource launch to a seperate file - Change local properties file name to .env - Add NGC documentation files - Bugfix: Exit if REGISTRY_ACCESS_TOKEN is not set. (Review comment) - Changes to messages and changes to fail early in launch script.

Commit:	cb589cb
Author:	Rajesh Ilango	2021-07-13 07:51:25 -0700
Committer:	Rajesh Ilango	2021-07-13 14:16:58 -0700

2021-06-22

Commit:	4bffd47
Author:	Michelle Gill	2021-06-22 17:18:32 -0400
Committer:	GitHub	2021-06-22 17:18:32 -0400

Trie and novelty metric (#42) * Multiprocessing script * Working processing script * Update chembldata class * Novelty metric working * Metric changes

Commit:	e7abab1
Author:	Michelle Gill	2021-06-22 17:18:32 -0400
Committer:	GitHub	2021-06-22 17:18:32 -0400

Trie and novelty metric (#42) * Multiprocessing script * Working processing script * Update chembldata class * Novelty metric working * Metric changes

2021-06-11

Commit:	60dca89
Author:	Michelle Gill	2021-06-11 10:14:55 -0500
Committer:	GitHub	2021-06-11 11:14:55 -0400

WIP: Benchmark framework for MegaMolBART metrics (#40) * Change to add an other method to retrieve embedding. * working metric benchmark * metric updates * drop dupes * Refine run file * Benchmark for 10000 * Checkpoint for 350000 * Finished NN 610000 metrics * Prune data * Backup data file for unique * new molecule excluded * Finished 350000 benchmark * Another molecule * Metrics plot * Another molecule * Add molecule * Data for 50000 * Validity 610000 * Update megamolbart radius * more loaders and result * Loader and progress * Add unique results * cleanup * Unique 350000 * Bkup files * Unique 50000 * Uniqute 610000 * one validitity 610000 * Validity * One val point for 10k * Final validity point * Metrics finished * Update .dockerignore * Update generativesampler.proto * Update megamolbart.py * Update loaders.py * Update prepare_ChEMBL_approved_drugs_data.py * Update service.py * Update fingerprint.py * Update model.py * Uses data helper class * Update plot.py Co-authored-by: Rajesh Ilango <rilango@nvidia.com>

Commit:	890b367
Author:	Michelle Gill	2021-06-11 10:14:55 -0500
Committer:	GitHub	2021-06-11 11:14:55 -0400

2021-04-30

Commit:	57c5ab8
Author:	RahulBaboota	2021-04-30 10:59:04 -0700

GRPC Service

Commit:	9b0110c
Author:	RahulBaboota	2021-04-30 10:59:04 -0700

GRPC Service

2021-04-29

Commit:	e02bd7f
Author:	Rajesh Ilango	2021-04-29 15:00:17 -0700
Committer:	GitHub	2021-04-29 18:00:17 -0400

Initial refactor for supporting more than one container to work with the application. (#30) * Initial refactor for supporting more than one container to work with the application. * Changes to make unit tests and fix for failing tests * Additional changes for exposing gRPC service. * Changes to pull values from config file into docker-compose * Move grpc code into megamolbart. * Changes to parameterize container name in docker-compose.yml Co-authored-by: Rajesh Ilango <rilango@nvidia.com>

Commit:	fc42ff8
Author:	Rajesh Ilango	2021-04-29 15:00:17 -0700
Committer:	GitHub	2021-04-29 18:00:17 -0400

2021-04-27

Commit:	e5c141f
Author:	Rajesh Ilango	2021-04-26 21:15:12 -0700

Initial refactor for supporting more than one container to work with the application.

Commit:	b6cc766
Author:	Rajesh Ilango	2021-04-26 21:15:12 -0700

Initial refactor for supporting more than one container to work with the application.

2021-03-31

Commit:	f7486dc
Author:	Rajesh Ilango	2021-03-31 12:39:24 -0700
Committer:	GitHub	2021-03-31 15:39:24 -0400

Ability to sample molecules in the vicinity of a given molecule in latent space (#13) * Add the ability to sample molecules in the vicility of a given molecule in the latent space. This changes also includes: - UI changes to sample molecules from the visualization tool - gRPC service for Sampling service Minor refactor to rename files, variable and classes. * Bugfixes and code for perf testing * Performance testing code and some bugfixes * Bugfix: Add back unintensional edit * Adding a new script to generate molecule and store in a CSV file. * Performance Chane: Use Dask to parallelize SMILES generation * Fix the default radius for adding jitter. Co-authored-by: Rajesh Ilango <rilango@nvidia.com>

Commit:	cc9f391
Author:	Rajesh Ilango	2021-03-31 12:39:24 -0700
Committer:	GitHub	2021-03-31 15:39:24 -0400

2021-03-01

Commit:	a9c0237
Author:	Rajesh Ilango	2021-03-01 14:10:00 -0800
Committer:	GitHub	2021-03-01 17:10:00 -0500

New Feature: Initial verion of generative model inference... (#4) * New Feature: Initial verion of generative model inference... UI Changes to allow users to select two CHEMBLE ids for input to model to generate new embeddings in the latent space between the selected molecules. Other changes includes: - Some cleanup in chemvisualize.py - Introduce molecular decorators for computing additional properties - A RESTFUL webservice to expose interpolator - A GRPC service to expose interpolator * Bugfix: LogP value is missing for generated molecules * Bugfig: Minor UI issue with row border near mol structure * Bugfix: Review comments... - Interpolation Bug: Database search using ID is case sensetive. Code is changed to always convert ID to uppercase. - ChEMBL ID Column in Table: ChEMBL Id is remove from the table. User can still add it from the dropdown. At the moment ChEMBL is used for search. All this search is done on clusterd data which is retrieved from ChEMBL DB. There is no use for ChEMBL in the interpolation part of the code. - Removed Extraneous Boxes from Generate Panel * Bugfix: Remove hardcoded GPUs * Review Feedback: Color code molecular property based on rules... Added the ability to mark 'level' for a data. At the moment allowed levels are 'info', 'warning' and 'error'. This levels can be used to map data to a style. * Placeholder code to add QED. Other changes required is to rename the decorator class. Co-authored-by: Rajesh Ilango <rilango@nvidia.com>

Commit:	543d35e
Author:	Rajesh Ilango	2021-03-01 14:10:00 -0800
Committer:	GitHub	2021-03-01 17:10:00 -0500