These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)
Commit: | 93578fc | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add check_sla support for pod restart Summary: Now pod restart would not resepct SLA of a job, since it is a user initiated action. However, if some external script is trying to do some background work using restart, the script would not want to break SLA. To support such use case, add check_sla flag. Reviewers: O521 Project Peloton: Add reviewers, sachins Reviewed By: O521 Project Peloton: Add reviewers, sachins Differential Revision: https://code.uberinternal.com/D3368191
The documentation is generated from this commit.
Commit: | bde8e00 | |
---|---|---|
Author: | Anant Vyas | |
Committer: | Anant Vyas |
[resmgr] Add force flag to UpdateResourcePoolRequest Summary: This diff adds an optional `force` bool to the `UpdateResourcePoolRequest` which will skip the validation. There are certain situations where the resource pool can end up in a state where no updates can be performed because the underlying resources have diverged from the resource pool tree representation. To make the tree consistent again we need to skip validation. The operator should be very careful when performing operations by skipping the validation. This diff will be followed by a cli change to use the new flag. Test Plan: Refactored a bunch of tests in `handler_test` to table driven tests and removed some unused code/variables. Reviewers: O521 Project Peloton: Add reviewers, apoorvaj, sachins Reviewed By: O521 Project Peloton: Add reviewers, sachins Subscribers: kevinxu Maniphest Tasks: T3846023 Differential Revision: https://code.uberinternal.com/D3353113
Commit: | cf6d089 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Support revocable filtering for GetHostsByQuery Summary: GetHostsByQuery only gets non-revocable resources, now add a flag so we would no how much resource is available to place revocable resources. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton hostmgr hosts -r Hostname| CPU| GPU| MEM| Disk| State| Task Hold| Task Running peloton-mesos-agent0| 2.80| 0.00| 2040.00 MB| 19960.00 MB| ready| |8d2caa4a-d361-4130-a7a5-47abfc85920a-2-1 0f3c9a61-b3a2-4f86-a920-9086a8eb1306-2-1 0f3c9a61-b3a2-4f86-a920-9086a8eb1306-1-1 113278e1-172f-4af5-b0ae-7600763d81cd-1-1 peloton-mesos-agent1| 2.90| 0.00| 2040.00 MB| 19960.00 MB| ready| |0f3c9a61-b3a2-4f86-a920-9086a8eb1306-0-1 113278e1-172f-4af5-b0ae-7600763d81cd-0-1 8d2caa4a-d361-4130-a7a5-47abfc85920a-0-1 8d2caa4a-d361-4130-a7a5-47abfc85920a-1-1 peloton-mesos-agent2| 3.00| 0.00| 2046.00 MB| 19990.00 MB| ready| |113278e1-172f-4af5-b0ae-7600763d81cd-2-1 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton hostmgr hosts Hostname| CPU| GPU| MEM| Disk| State| Task Hold| Task Running peloton-mesos-agent0| 3.90| 0.00| 2040.00 MB| 19960.00 MB| ready| |0f3c9a61-b3a2-4f86-a920-9086a8eb1306-1-1 113278e1-172f-4af5-b0ae-7600763d81cd-1-1 8d2caa4a-d361-4130-a7a5-47abfc85920a-2-1 0f3c9a61-b3a2-4f86-a920-9086a8eb1306-2-1 peloton-mesos-agent1| 3.80| 0.00| 2040.00 MB| 19960.00 MB| ready| |8d2caa4a-d361-4130-a7a5-47abfc85920a-0-1 8d2caa4a-d361-4130-a7a5-47abfc85920a-1-1 0f3c9a61-b3a2-4f86-a920-9086a8eb1306-0-1 113278e1-172f-4af5-b0ae-7600763d81cd-0-1 peloton-mesos-agent2| 4.00| 0.00| 2046.00 MB| 19990.00 MB| ready| |113278e1-172f-4af5-b0ae-7600763d81cd-2-1 ``` Reviewers: O521 Project Peloton: Add reviewers, sachins, apoorvaj Reviewed By: O521 Project Peloton: Add reviewers, sachins Differential Revision: https://code.uberinternal.com/D3356103
Commit: | 14e5172 | |
---|---|---|
Author: | Mayank Bansal | |
Committer: | Mayank Bansal |
Integrating host mover with scorer and exposing the move host api Summary: This diff caters to following items - Adding Move Hosts api from host mover which will be used by host pool manager - Integrating with resource manager scorer for getting the hosts - Integrating with host pool manager for updating the desired pool - Integrating with host goal state engine Once MoveHosts api will be valled from the host pool manager this will be integrated with all the steps Test Plan: tests included Reviewers: sachins, O521 Project Peloton: Add reviewers Reviewed By: sachins, O521 Project Peloton: Add reviewers Subscribers: apoorvaj, amitbose, varung, adityacb, jenkins Differential Revision: https://code.uberinternal.com/D3319351
Commit: | bb4797e | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
[Host Maintenance](Part 1) Abstract out jobmgr::preemptor to poll from resmgr and hostmgr Summary: This is first of the many diffs for Peloton Host Maintenance 2.0 This diff does the following. 1. Abstracts out the preemptor code in jobmgr to poll from both resmgr - for tasks to preempt - and hostmgr - for tasks to kill for host maintenance. 2. Adds an private hostmgr API(without implementation) to get tasks on hosts in DRAINING state. Test Plan: Unit tests Reviewers: zhixin, O521 Project Peloton: Add reviewers, adityacb Reviewed By: O521 Project Peloton: Add reviewers, adityacb Subscribers: adityacb, jenkins Differential Revision: https://code.uberinternal.com/D3288001
Commit: | 24f2d4a | |
---|---|---|
Author: | Amit Bose | |
Committer: | Amit Bose |
Publish host-pool capacity Summary: Adding an internal host-mgr API to give the resource capacity of all host-pools (similar to cluster capacity). For each host-pool, the capacity gets cached and it is refreshed periodically with the reconciliation cycle, or when a host moves from one pool to another. Test Plan: Unit-tests Reviewers: O521 Project Peloton: Add reviewers, yunpeng Reviewed By: yunpeng Subscribers: jenkins Maniphest Tasks: T3574041 Differential Revision: https://code.uberinternal.com/D3263663
Commit: | 11776b3 | |
---|---|---|
Author: | bin zhang | |
Committer: | bin zhang |
set ports in LaunchPod Summary: For P2K, jobmgr will set ports allocated in the launch request, and hostmgr will then save the ports in the first Container to persist the allocated ports. Reviewers: O521 Project Peloton: Add reviewers, zhixin Reviewed By: O521 Project Peloton: Add reviewers, zhixin Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3263705
Commit: | 30204e6 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Preempt tasks on mesos-task-id Summary: Earlier, resmgr exposed peloton-task-id of the tasks to be preempted. Jobmgr killed these tasks. The issue with this is jobmgr might kill tasks incorrectly if jobmgr processes the KILLED event faster than resmgr and the preemption cycle runs in resmgr before it processes the KILLED event. In such a case, jobmgr might incorrectly kill the newer run of the task (preemption was meant for older run). This diff addresses the above issue by preempting tasks on mesos-task-id. If the mesos-task-id of the task-to-be-preempted is different from the task in jobmgr then the kill is skipped. Test Plan: Unit tests Reviewers: avyas, zhixin, O521 Project Peloton: Add reviewers Reviewed By: avyas, zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3279417
Commit: | 440be71 | |
---|---|---|
Author: | Kevin Xu | |
Committer: | Kevin Xu |
Watch api for jobs handler implementation Summary: Watch api for jobs implementation for watch processor and handler. Reviewers: apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T3787145 Differential Revision: https://code.uberinternal.com/D3257365
Commit: | a0b2002 | |
---|---|---|
Author: | Gautam Kulkarni | |
Committer: | Gautam Kulkarni |
Enhanced the QoS API GetServicesImpacted Summary: This change adds resources consumed by a service instance. This information will be used by Peloton to decide which sets of services can be evicted to satisfy the "resources needed" constraint on a given host. Reviewers: shyiko, amitbose, O521 Project Peloton: Add reviewers Reviewed By: shyiko, amitbose, O521 Project Peloton: Add reviewers Subscribers: mabansal, jenkins Differential Revision: https://code.uberinternal.com/D3258581
Commit: | bb351d7 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add network specification to the apachemesos pod specification. Summary: Allowing a customer to connect to different than host network in mesos. This is used by the federation layer for minicluster based testing. Reviewers: kevinxu, zhixin, O521 Project Peloton: Add reviewers, amitbose Reviewed By: kevinxu, O521 Project Peloton: Add reviewers, amitbose Subscribers: adityacb, yiran, amitbose, jenkins Differential Revision: https://code.uberinternal.com/D3234937
Commit: | e62d6f5 | |
---|---|---|
Author: | Mayank Bansal | |
Committer: | Mayank Bansal |
Adding Host pool information into host info table for Goal State for host mover Summary: We need to add goal state engine for host mover for two reasons 1. We need to handle recovery 2. We also need to consistently handle the failures scenarios This diff addresses the DB part of creating the host pool info in host info tavel Adding two fields in to host info table 1. current_pool 2. desired_pool We need this two fields for maintaing the goal state engine for host pool in host mover Test Plan: - Unit tests added - tested with mini cluster Reviewers: varung, sachins, rcharles, amitbose, O521 Project Peloton: Add reviewers Reviewed By: varung, sachins, O521 Project Peloton: Add reviewers Subscribers: apoorvaj, yunpeng, jenkins Differential Revision: https://code.uberinternal.com/D3245931
Commit: | 2767767 | |
---|---|---|
Author: | Aditya Bhave | |
Committer: | Aditya Bhave |
Fix files in mesos protobuf to their correct version from upstream 1.7.x branch Summary: While building java client, I got this exception: ``` [ERROR] /Users/adityacb/gocode/src/code.uber.internal/infra/peloton-client/java/client/../../protobuf/mesos/v1/resource_provider/resource_provider.proto [0:0]: mesos/v1/executor/executor.proto:56:22: "mesos.v1.maintenance.Window.machine_ids" is already defined in file "mesos/v1/maintenance/maintenance.proto". ``` When I debugged further, I figured out that the mesos proto files in the peloton repo were wrong. The executor.proto had been overwritten with maintenance.proto. Replacing them with the appropriate version from apache mesos repo for 1.7.x branch.. Reviewers: O521 Project Peloton: Add reviewers, sachins Reviewed By: O521 Project Peloton: Add reviewers, sachins Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3248419
Commit: | 61c260b | |
---|---|---|
Author: | Aditya Bhave | |
Committer: | Aditya Bhave |
Add event id to the v1alpha event stream for dedupe Summary: Add EventID to the v1alpha proto for eventstream, so that we can use this for deduping event stream. For Mesos, this will be the UUID from the mesos TaskStatusUpdate. For K8S, this will be the ResourceVersion from the pod status update. Next up, add southbound dedupe for mesos plugin. Next next up, add northbound dedupe in v1alpha event stream. Reviewers: O521 Project Peloton: Add reviewers, evelynl, zhixin Reviewed By: evelynl Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3241513
Commit: | 696fcbc | |
---|---|---|
Author: | Evelyn Liu | |
Committer: | Evelyn Liu |
Add KillAndHoldPods handler in hostmgtsvc Summary: part 2 of in place upgrade, following in place upgrade design in hostpool and hostsvc. Test Plan: TAGS=k8s make stateless-integ-test Reviewers: #p2k, O521 Project Peloton: Add reviewers, adityacb Reviewed By: #p2k, O521 Project Peloton: Add reviewers, adityacb Subscribers: adityacb, jenkins Differential Revision: https://code.uberinternal.com/D3237341
Commit: | 30fa24a | |
---|---|---|
Author: | Mariia Fesenko | |
Committer: | Mariia Fesenko |
Add support for configuring ports using a template variable Reviewers: O521 Project Peloton: Add reviewers, apoorvaj Reviewed By: O521 Project Peloton: Add reviewers, apoorvaj Subscribers: apoorvaj, jenkins Differential Revision: https://code.uberinternal.com/D3170515
Commit: | ef79cac | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Add jobmgr private API to get instance availability information Summary: This diff adds a debug API (and CLI) to get instance availability information for a job. Test Plan: Unit tests ``` $ bin/peloton jobmgr instance-availability db63bf76-9dd1-4464-9281-ae38904f6aab --instances 0,2,3 { "instance_availability_map": { "0": "InstanceAvailability_UNAVAILABLE", "2": "InstanceAvailability_KILLED", "3": "InstanceAvailability_AVAILABLE" } } $ bin/peloton jobmgr instance-availability 786f663c-ce6b-4d6f-98d8-a2ccb9f32541 instance_availability_map: "0": InstanceAvailability_AVAILABLE "1": InstanceAvailability_AVAILABLE "2": InstanceAvailability_AVAILABLE ``` Reviewers: zhixin, apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3191905
Commit: | 693362e | |
---|---|---|
Author: | Aihua Xu | |
Committer: | Aihua Xu |
Host mover scorer: implement batch host scorer Summary: Implements the host scorer for host mover which updates host metrics from task events and generates a sorted host list by ordering such metrics periodically. The caller can retrieve such cached hosts. Reviewers: O521 Project Peloton: Add reviewers, mabansal, pgolash, amitbose, apoorvaj, varung Reviewed By: O521 Project Peloton: Add reviewers, mabansal, amitbose, apoorvaj Subscribers: apoorvaj, amitbose, jenkins Maniphest Tasks: T3717507 Differential Revision: https://code.uberinternal.com/D3118027
Commit: | 0966c2e | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Upgrade minicluster mesos config to 1.7.1 Summary: Bump up the mesos config from 1.6.0 to 1.7.1 for minicluster Test Plan: Unit Test Integration Test Reviewers: sachins, O521 Project Peloton: Add reviewers Reviewed By: sachins, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3101533
Commit: | e481a09 | |
---|---|---|
Author: | Yunpeng Liu | |
Committer: | Yunpeng Liu |
Add host labels into host infos table Summary: Add host labels into host_infos table and corresponding protobuf messages. Reviewers: O521 Project Peloton: Add reviewers, amitbose Reviewed By: O521 Project Peloton: Add reviewers, amitbose Subscribers: varung, rcharles, apoorvaj, jenkins Differential Revision: https://code.uberinternal.com/D3134081
Commit: | 18ffb90 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Mark failed instances and instances killed for update as unavailable Summary: This diff is a follow up of 'Define SLAInfo and add SLA Awareness to Host Maintenance'(42822c15a8023f4717c7d708c1e3f94d56ceb93b) which marked instances killed for update as KILLED. This diff does the following 1. Marks failed instances and instances killed for update as unavailable. 2. In update goalstate actions, handle `PatchTasks` retry for task(s) not in cache. With this diff, the SLA Aware Pod Kill feature is complete. Test Plan: - Added unit tests - Added integration tests Reviewers: zhixin, apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3146873
Commit: | 95413a8 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Add populate for host to task map with recovery Summary: Goal here is to build host to task map in recovery. It happens on following actions - Recovery at host manager on leadership change. Only add tasks which are non-terminal state i.e. host assigned. - Launch a new task on this host. - Receiving Mesos Task Status update vi event stream. Removes host to task mapping on terminal event. Reviewers: apoorvaj, amitbose, adityacb, sachins, O521 Project Peloton: Add reviewers Reviewed By: amitbose, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3111209
Commit: | c295b89 | |
---|---|---|
Author: | Amit Bose | |
Committer: | Amit Bose |
Add APIs for operating on host pools Summary: Define some simple APIs in host-manager for CRUD operations on host-pools. Also wired up the implementation for these APIs to HostPoolManager. Removed the private hostmgr API for changing host pool because the public API covers it as well. Test Plan: Unit-tests Reviewers: O521 Project Peloton: Add reviewers, yunpeng Reviewed By: yunpeng Subscribers: jenkins Maniphest Tasks: T3604677, T3638523 Differential Revision: https://code.uberinternal.com/D3144859
Commit: | 1587ba9 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add and populate resources to be given to the mesos executor Reviewers: kevinxu, O521 Project Peloton: Add reviewers Reviewed By: kevinxu, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3144647
Commit: | 29cb604 | |
---|---|---|
Author: | Amit Bose | |
Committer: | Amit Bose |
Implement host-pool change and publish event Summary: Add code to change the host-pool associated with a host. Also generate a event on the host-manager evenstream when this happens. A new event-type is defined for this because existing events are task specific. The new event is defined only in v0 at this time; v1alpha will come in a subsequent diff. Note that because of this new event type, guards had to be added in a bunch of files such as common/statusupdate, jobmgr/task/event, resmgr/handler and hostmgr/handler. The code in these files was written with the assumption that events are for task status updates only. Test Plan: Unit-tests Reviewers: O521 Project Peloton: Add reviewers, yunpeng Reviewed By: yunpeng Subscribers: apoorvaj, jenkins Maniphest Tasks: T3554395 Differential Revision: https://code.uberinternal.com/D3133171
Commit: | 7590a45 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Implement lock on kill operation Summary: Add support for lock on kill/executor shutdown. Such operations would return an error upon lock, so goal state engine would retry the action later. Test Plan: Tested with minicluster. TaskStop action is blocked after cli issues lock cmd. After unlock is issued, goal state engine waits for exponential backoff, and the job get killed. Reviewers: O521 Project Peloton: Add reviewers, sachins, adityacb, apoorvaj Reviewed By: O521 Project Peloton: Add reviewers, adityacb Subscribers: jenkins Maniphest Tasks: T2139751 Differential Revision: https://code.uberinternal.com/D3120159
Commit: | 5716051 | |
---|---|---|
Author: | Sishi Long | |
Committer: | Sishi Long |
Load Aware placement Summary: - Create CQos advisor client if CQos address is set in the Cluster config hostmgr section. for now, CQos discovery will be through a static ip address and port. may change in the future - default to load aware ranker if bin_packing(Cluster config hostmgr param) is set to "LOAD_AWARE" - Poll from CQos advisor to get hosts metrics which has load information of each host - Rank the host summary list by the host load in ascending order using bucket sort O(n) - Refresh the host summary list asynchronously Test Plan: Unit tests added Reviewers: amitbose, kulkarni, O521 Project Peloton: Add reviewers, varung, apoorvaj Reviewed By: amitbose, O521 Project Peloton: Add reviewers, varung Subscribers: varung, apoorvaj, jenkins Maniphest Tasks: T3564301, T3565421 Differential Revision: https://code.uberinternal.com/D3074469
Commit: | a1f7f9c | |
---|---|---|
Author: | Antoine Pourchet | |
Committer: | Antoine Pourchet |
HostMgr: GetHostCache Summary: GetHostCache will allow us to dump the contents of the hostcache in hostmanager to let us debug what its state is in memory. Test Plan: Unit tests pass. Reviewers: adityacb, O521 Project Peloton: Add reviewers Reviewed By: adityacb, O521 Project Peloton: Add reviewers Subscribers: varung, jenkins Differential Revision: https://code.uberinternal.com/D3096897
Commit: | 175ff17 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add support for Read/Write API lock Summary: Provide a mechanism to lock Job Manager read and write API. Add `LabelManager` to add label for APIs. Peloton uses config file to add read and write labels to each API, and `APILockInboundMiddleware` uses the label to decide if an API belongs to read/write API. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton admin lock Read Write lock on Read succeeded lock on Write succeeded (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless create /DefaultResPool ~/Desktop/testSpec.yaml 0 peloton: error: code:internal message:All write APIs are locked down (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get 11111 peloton: error: code:internal message:All read APIs are locked down (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton admin unlock Write unlock on Write succeeded (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless create /DefaultResPool ~/Desktop/testSpec.yaml 0 Job 83fd50df-9feb-4172-873d-8ad1f7372b64 created. Entity Version: 1-1-1 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get 83fd50df-9feb-4172-873d-8ad1f7372b64 peloton: error: code:internal message:All read APIs are locked down (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton admin unlock Read unlock on Read succeeded (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get 83fd50df-9feb-4172-873d-8ad1f7372b64 job_info: job_id: value: 83fd50df-9feb-4172-873d-8ad1f7372b64 spec: default_spec: containers: - command: shell: true value: while :; do echo running 1; sleep 1; done image: "" name: "" resource: cpu_limit: 0.1 disk_limit_mb: 10 fd_limit: 0 gpu_limit: 0 mem_limit_mb: 2 controller: false kill_grace_period_seconds: 0 revocable: false description: A dummy test spec for peloton instance_count: 30 labels: - key: testKey0 value: testVal0 - key: testKey1 value: testVal1 - key: testKey2 value: testVal2 ldap_groups: - team6 - otto name: TestSpec owner: testUser owning_team: testTeam respool_id: value: cb74cb4f-a068-46b9-ae70-d8aca50488fc revision: created_at: "1564166186466543400" updated_at: "1564166186466543400" updated_by: "" version: "1" sla: maximum_unavailable_instances: 0 preemptible: false priority: 0 revocable: false status: creation_time: "2019-07-26T18:36:26.4693266Z" desired_state: JOB_STATE_RUNNING pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 0 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RESERVED: 0 POD_STATE_RUNNING: 30 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 pod_stats_by_configuration_version: 1-0-0: state_stats: POD_STATE_RUNNING: 30 revision: created_at: "1564166186469326600" updated_at: "1564166201939393200" updated_by: "" version: "8" state: JOB_STATE_RUNNING version: value: 1-1-1 workflow_status: completion_time: "2019-07-26T18:36:41.9623491Z" creation_time: "2019-07-26T18:36:26.478Z" num_instances_completed: 30 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 0-1-1 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: "2019-07-26T18:36:41.962Z" version: value: 1-1-1 workflow_info: events: - state: WORKFLOW_STATE_SUCCEEDED timestamp: "2019-07-26T18:36:41Z" type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: "2019-07-26T18:36:26Z" type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: "2019-07-26T18:36:26Z" type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: "2019-07-26T18:36:26Z" type: WORKFLOW_TYPE_UPDATE instances_added: - from: 0 to: 29 opaque_data: data: "" status: completion_time: "2019-07-26T18:36:41.9623491Z" creation_time: "2019-07-26T18:36:26.478Z" num_instances_completed: 30 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 0-1-1 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: "2019-07-26T18:36:41.962Z" version: value: 1-1-1 update_spec: batch_size: 0 in_place: false max_instance_retries: 0 max_tolerable_instance_failures: 0 rollback_on_failure: false start_paused: false start_pods: false ``` Reviewers: apoorvaj, sachins, O521 Project Peloton: Add reviewers, adityacb Reviewed By: sachins, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2139751 Differential Revision: https://code.uberinternal.com/D3093967
Commit: | 1382aa2 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add translation to new v1alpha API Summary: Add the translation to and from the newly added components in v1alpha API to v0 API. This translation should be used only for mesos when the executor data is provided in the API. Reviewers: O521 Project Peloton: Add reviewers, zhixin, adityacb Reviewed By: O521 Project Peloton: Add reviewers, zhixin Subscribers: pourchet, jenkins Differential Revision: https://code.uberinternal.com/D2925301
Commit: | d4507f0 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add pause/resume goal state engine support Summary: Add admin service which allow lock/unlock different peloton components. This diff adds the first use case of the admin service, which enables lockdown of goal state engine. Test Plan: # case1: enqueued into goal state engine after lock ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton admin lock GoalStateEngine lock on GoalStateEngine succeeded (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton respool create /DefaultResPool ./example/default_respool.yaml Resource Pool dddb7513-0962-488c-8f0b-1500d0d7f215 created at /DefaultResPool # job is not created due to the lock (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless create /DefaultResPool ~/Desktop/testSpec.yaml 0 Job 9e019abb-cacf-4d9e-bbda-11f34dcc95df created. Entity Version: 1-1-1 # job get created after lock is removed (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton admin unlock GoalStateEngine unlock on GoalStateEngine succeeded ``` # case 2: enqueued into goal state before lock ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless create /DefaultResPool ~/Desktop/testSpec.yaml 0; sleep 1; ./bin/peloton admin lock GoalStateEngine Job 9be429a7-a79c-44f7-916d-f6b21c6b48b8 created. Entity Version: 1-1-1 lock on GoalStateEngine succeeded # job is partially created (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton admin unlock GoalStateEngine unlock on GoalStateEngine succeeded # job continue to be created ``` Reviewers: apoorvaj, sachins, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2139751 Differential Revision: https://code.uberinternal.com/D3066873
Commit: | 680ac77 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Upgrade minicluster mesos config to 1.7.1 Summary: Bump up the mesos config from 1.6.0 to 1.7.1 for minicluster Test Plan: Unit Test Integration Test Reviewers: sachins, O521 Project Peloton: Add reviewers Reviewed By: sachins, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3101533
Commit: | f2cf4d9 | |
---|---|---|
Author: | Arpit Goyal | |
Committer: | Arpit Goyal |
Added support for listening to hostSummary proto object Summary: # Added a new hostSummary object proto # Make watchevent/processor.go agnostic to the event type. It take event object as an interface. # Introduce a concept of topic which is the** event object type **the client want to listen. Currently supporting eventstream,Event and hostmgr.HostSummary object type. # Added notifyEvent function in summary.go which propagate changes to the client listening for the hostsummary. # Added watchProcessor parameter to all the dependent struct object # Updated the test cases to include these changes. Test Plan: Tested locally using minicluster. Case 1 : Request watch client for hostSummary topic {F140761043} Case 2 : Request watch client for mesos task update topic {F140761087} Reviewers: O521 Project Peloton: Add reviewers, varung, adityacb Reviewed By: O521 Project Peloton: Add reviewers, varung Subscribers: amitbose, yiran, adityacb, jenkins Tags: #peloton Maniphest Tasks: T3115863 Differential Revision: https://code.uberinternal.com/D2901731
Commit: | 40707a9 | |
---|---|---|
Author: | Gautam Kulkarni | |
Committer: | Gautam Kulkarni |
Added cQoS API protobuf definitions Summary: The cQoS APIs are added as a third-party dependency in Peloton. Reviewers: amitbose, sishi, shyiko, rmalpani, O521 Project Peloton: Add reviewers, apoorvaj Reviewed By: amitbose, sishi, shyiko, rmalpani, O521 Project Peloton: Add reviewers, apoorvaj Subscribers: avyas, apoorvaj, jenkins Differential Revision: https://code.uberinternal.com/D3059027
Commit: | 52d0fd0 | |
---|---|---|
Author: | Yunpeng Liu | |
Committer: | Yunpeng Liu |
Add new rpc call ChangeHostPool to private protobuf Summary: - Add new rpc call ChangeHostPool to private protobuf. Reviewers: O521 Project Peloton: Add reviewers, amitbose Reviewed By: O521 Project Peloton: Add reviewers, amitbose Subscribers: apoorvaj, adityacb, pgolash, jenkins Tags: #peloton Differential Revision: https://code.uberinternal.com/D3054555
Commit: | 080928d | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Revert "Revert "Define SLAInfo and add SLA Awareness to Host Maintenance"" Summary: This reverts commit 85702022fdcc513500d6c28c04d600c031be2d0c. Reviewers: zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3060759
Commit: | 6f3aeec | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Revert "Revert "Define SLAInfo and add SLA Awareness to Host Maintenance"" Summary: This reverts commit 85702022fdcc513500d6c28c04d600c031be2d0c. Reviewers: zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3060759
Commit: | f000a28 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Revert "Define SLAInfo and add SLA Awareness to Host Maintenance" Summary: This reverts commit 42822c15a8023f4717c7d708c1e3f94d56ceb93b. The above commit has some performance degradation. Reverting it till issue is root-caused and/or fixed. Reviewers: sishi, O521 Project Peloton: Add reviewers Reviewed By: sishi, O521 Project Peloton: Add reviewers Subscribers: jenkins Tags: #peloton Differential Revision: https://code.uberinternal.com/D3055239
Commit: | 8570202 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Revert "Define SLAInfo and add SLA Awareness to Host Maintenance" Summary: This reverts commit 42822c15a8023f4717c7d708c1e3f94d56ceb93b. The above commit has some performance degradation. Reverting it till issue is root-caused and/or fixed. Reviewers: sishi, O521 Project Peloton: Add reviewers Reviewed By: sishi, O521 Project Peloton: Add reviewers Subscribers: jenkins Tags: #peloton Differential Revision: https://code.uberinternal.com/D3055239
Commit: | 42822c1 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Define SLAInfo and add SLA Awareness to Host Maintenance Summary: This diff does the following - Define SLA info in job cache. - Add watch in task runtime to update job SLA whenever task runtime changes. - Add SLA Awareness to Host Maintenance. Will address the following in coming diffs - Add SLA awareness to updates - Integration tests for SLA Aware Host Maintenance. Test Plan: - Unit tests - Tested on minicluster Reviewers: apoorvaj, zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: avyas, jenkins Tags: #peloton Maniphest Tasks: T2057243 Differential Revision: https://code.uberinternal.com/D2961007
Commit: | b02cef7 | |
---|---|---|
Author: | Charles Raimbert | |
Committer: | Charles Raimbert |
ORM: Add HostInfo object representing a host in maintenance Summary: All host maintenance information is currently stored directly by Mesos Master. Peloton needs to store its own host maintenance information since in the future not all the hosts in Mesos maintenance will also be in Peloton maintenance. This depends on ORM `GetAll()` capability fetching all rows of a table. Reviewers: O521 Project Peloton: Add reviewers, amitbose, adityacb Reviewed By: O521 Project Peloton: Add reviewers, amitbose, adityacb Subscribers: adityacb, yuweit, amitbose, jenkins Differential Revision: https://code.uberinternal.com/D2760047
Commit: | 8174b5a | |
---|---|---|
Author: | Antoine Pourchet | |
Committer: | Antoine Pourchet |
Added the PortRange protobuf for use in placement engine Summary: The HostLease needs to have a member field to tell the placement engine how many ports are available in this lease, and which ports they are. The PortRange proto follows the mesos definition for its own protos. Reviewers: adityacb, O521 Project Peloton: Add reviewers Reviewed By: adityacb, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D3005991
Commit: | cc1fb89 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Drop deprecated fields in resmgr.Placement proto message Summary: This diff drops the below deprecated fields from `resmgr.Placement` - `tasks` - `offerIds` It also removes any references in the code to the above fields. Test Plan: - Adapted unit tests Reviewers: avyas, varung, O521 Project Peloton: Add reviewers Reviewed By: avyas, O521 Project Peloton: Add reviewers Subscribers: apoorvaj, jenkins Maniphest Tasks: T3247507 Differential Revision: https://code.uberinternal.com/D3004177
Commit: | 02c6956 | |
---|---|---|
Author: | Charles Raimbert | |
Committer: | Charles Raimbert |
Host Maintenance APIs: refactor write API calls to a single host instead of a list Summary: The Host Maintenance public & private APIs performing a write action (public StartMaintenance/CompleteMaintenance, private MarkHostsDrained) currently request a write on a list of hostnames. Dealing with a single hostname per request is preferable instead, for better error handling and making the handlers simpler. This is preparatory work before adding persistence in Cassandra. Note: A consecutive diff will add YARPC errors for the all the Host Maintenance public & private APIs. Test Plan: adapted unit tests Reviewers: O521 Project Peloton: Add reviewers, sachins Reviewed By: O521 Project Peloton: Add reviewers, sachins Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2782675
Commit: | 62e9f9d | |
---|---|---|
Author: | Antoine Pourchet | |
Committer: | Antoine Pourchet |
Host Manager: Added TerminateLeases protobuf definition Summary: Added the protobuf definition for the HostManager `TerminateLeases` internal RPC call. Ideally, `TerminateLeases` would only take in a list of `LeaseID`. However, the `HostCache` internals only contain an index of the hosts on their hostnames, so we need to take in both `LeaseID` and `hostname` for each lease we want to terminate. At this time, we would like to avoid maintaining a new index of the hosts on their `LeaseID`, because that would lead to more complex code in the `HostCache` and more tricky synchronization that is prone to errors. Reviewers: adityacb, yiran, O521 Project Peloton: Add reviewers Reviewed By: adityacb, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2996359
Commit: | ec8f090 | |
---|---|---|
Author: | Charles Raimbert | |
Committer: | Charles Raimbert |
API: Request on a single host instead of list of hosts for HostMaintenance public APIs Summary: Standalone API protobuf change to deprecate the use of field `hostnames` as a list in favor of a single `hostname`. This is a pre-required step to update the peloton-client (its protobufs & generated bindings) Reviewers: O521 Project Peloton: Add reviewers, sachins Reviewed By: O521 Project Peloton: Add reviewers, sachins Subscribers: sachins, jenkins Differential Revision: https://code.uberinternal.com/D2968613
Commit: | bbd38e7 | |
---|---|---|
Author: | Yiran Wang | |
Committer: | Yiran Wang |
[P2K merge] Hostmgr & EventStream v1alpha private API Summary: Merging existing changes from P2K branch back to master. Hostmgr and Event stream API changes, including: 1. AcquireHosts/LaunchPod/KillPod API 2. New Pod Event that would be sent via the hostmgr event stream Reviewers: apoorvaj, adityacb, varung, mabansal, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: binz, min, sachins, jenkins Differential Revision: https://code.uberinternal.com/D2947063
Commit: | 62e1203 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Add API and metrics for orphan RM tasks Summary: Orphan RM tasks are tasks which are holding up resources but are no longer tracked by resource manager tracker. - Add API to dump orphan rm tasks - Add metrics to track number of orphan rm tasks Test Plan: Unit tests ``` $ bin/peloton resmgr tasks orphan TaskID| Hostname| CPU| GPU| MemoryMB| DiskMB| FD| f5ebd551-24ec-4958-9337-ebc0280aef51-2-1| peloton-mesos-agent1| 0.1| 0| 2| 10| 0| 16371521-0e95-468e-821e-bcd34f6b8a7d-0-1| peloton-mesos-agent0| 0.1| 0| 2| 10| 0| 16371521-0e95-468e-821e-bcd34f6b8a7d-2-1| peloton-mesos-agent1| 0.1| 0| 2| 10| 0| 16371521-0e95-468e-821e-bcd34f6b8a7d-1-1| peloton-mesos-agent0| 0.1| 0| 2| 10| 0| f5ebd551-24ec-4958-9337-ebc0280aef51-1-1| peloton-mesos-agent0| 0.1| 0| 2| 10| 0| f5ebd551-24ec-4958-9337-ebc0280aef51-0-1| peloton-mesos-agent0| 0.1| 0| 2| 10| 0| Total| | 0.6| 0| 12| 60| 0| ``` ``` $ bin/peloton resmgr tasks active TaskID| State| Hostname| Reason|Last Update Time f5ebd551-24ec-4958-9337-ebc0280aef51-2-3| RUNNING| peloton-mesos-agent1| task running|2019-06-13 17:33:03.9885956 +0000 UTC m=+79.255003801 16371521-0e95-468e-821e-bcd34f6b8a7d-0-3| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:32:51.4471924 +0000 UTC m=+66.713602001 16371521-0e95-468e-821e-bcd34f6b8a7d-1-3| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:32:51.4466702 +0000 UTC m=+66.713080001 16371521-0e95-468e-821e-bcd34f6b8a7d-2-3| RUNNING| peloton-mesos-agent1| task running|2019-06-13 17:32:51.4470689 +0000 UTC m=+66.713477701 f5ebd551-24ec-4958-9337-ebc0280aef51-1-3| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:33:03.9881695 +0000 UTC m=+79.254576901 f5ebd551-24ec-4958-9337-ebc0280aef51-0-3| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:33:03.9877578 +0000 UTC m=+79.254166001 f5ebd551-24ec-4958-9337-ebc0280aef51-2-1| RUNNING| peloton-mesos-agent1| task running|2019-06-13 17:32:58.2368606 +0000 UTC m=+73.503269001 16371521-0e95-468e-821e-bcd34f6b8a7d-0-1| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:32:45.5889834 +0000 UTC m=+60.855391501 16371521-0e95-468e-821e-bcd34f6b8a7d-2-1| RUNNING| peloton-mesos-agent1| task running|2019-06-13 17:32:45.5887879 +0000 UTC m=+60.855196301 16371521-0e95-468e-821e-bcd34f6b8a7d-1-1| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:32:45.5891792 +0000 UTC m=+60.855587301 f5ebd551-24ec-4958-9337-ebc0280aef51-1-1| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:32:58.1330845 +0000 UTC m=+73.399491701 f5ebd551-24ec-4958-9337-ebc0280aef51-0-1| RUNNING| peloton-mesos-agent0| task running|2019-06-13 17:32:58.1277643 +0000 UTC m=+73.394172901 ``` Reviewers: varung, O521 Project Peloton: Add reviewers Reviewed By: varung, O521 Project Peloton: Add reviewers Subscribers: jenkins Tags: #peloton Maniphest Tasks: T2869643 Differential Revision: https://code.uberinternal.com/D2842031
Commit: | 8cce6d6 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add endpoint to debug in-place update Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton hostmgr hosts Hostname| CPU| GPU| MEM| Disk| State| Task Hold peloton-mesos-agent0| 3.90| 0.00| 2046.00 MB| 19990.00 MB| held|280a51e5-2a3a-4e0a-b200-15b3f943fde5-1 peloton-mesos-agent1| 4.00| 0.00| 2048.00 MB| 20000.00 MB| ready| peloton-mesos-agent2| 3.90| 0.00| 2046.00 MB| 19990.00 MB| ready| (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton hostmgr hosts Hostname| CPU| GPU| MEM| Disk| State| Task Hold peloton-mesos-agent0| 0.00| 0.00| 0.00 MB| 0.00 MB| ready| peloton-mesos-agent1| 4.00| 0.00| 2048.00 MB| 20000.00 MB| ready| peloton-mesos-agent2| 3.90| 0.00| 2046.00 MB| 19990.00 MB| ready| (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton hostmgr hosts Hostname| CPU| GPU| MEM| Disk| State| Task Hold peloton-mesos-agent0| 0.00| 0.00| 0.00 MB| 0.00 MB| ready| peloton-mesos-agent1| 4.00| 0.00| 2048.00 MB| 20000.00 MB| ready| peloton-mesos-agent2| 4.00| 0.00| 2048.00 MB| 20000.00 MB| placing|280a51e5-2a3a-4e0a-b200-15b3f943fde5-2 ``` Reviewers: O521 Project Peloton: Add reviewers, varung, apoorvaj Reviewed By: O521 Project Peloton: Add reviewers, varung Subscribers: jenkins Maniphest Tasks: T2466573 Differential Revision: https://code.uberinternal.com/D2909731
Commit: | abda8c9 | |
---|---|---|
Author: | Kevin Xu | |
Committer: | Kevin Xu |
Change watch stream buffer overflow error code to Aborted Summary: Change watch stream buffer overflow error code to Aborted Reviewers: O521 Project Peloton: Add reviewers, apoorvaj, jasn Reviewed By: O521 Project Peloton: Add reviewers, apoorvaj Subscribers: jenkins, apoorvaj Differential Revision: https://code.uberinternal.com/D2889077
Commit: | 3de2f67 | |
---|---|---|
Author: | Aihua Xu | |
Committer: | Aihua Xu |
Add integration test for host reservation feature Summary: Add basic integration test for host reservation feature to verify the task gets into reserved state, runs on the reserved hosts and completes eventually. The logic to detect if all placement cycles are exhausted is also getting updated so the task gets into reserved state after PlacementRetryCycle * PlacementAttemptsPerCycle instead of original PlacementRetryCycle * PlacementAttemptsPerCycle + 1. Test Plan: Integration test. Reviewers: O521 Project Peloton: Add reviewers, mabansal, apoorvaj Reviewed By: O521 Project Peloton: Add reviewers, mabansal, apoorvaj Subscribers: avyas, apoorvaj, varung, jenkins Tags: #peloton_batch Maniphest Tasks: T2920459 Differential Revision: https://code.uberinternal.com/D2846095
Commit: | 0eb04c1 | |
---|---|---|
Author: | Amit Bose | |
Committer: | Amit Bose |
Placement hints for batch jobs Summary: In the batch placement plugin, we try to pack as many tasks on a host as possible, which can have adverse effects on task performance due to interference (for example, if all tasks require high disk I/O). Such jobs can can work better if the tasks are spread out across many hosts. This change adds the notion of a placement hint to batch jobs. Two strategies are supported: * Pack: the current behavior * Spread: distribute tasks over many hosts The spread strategy doesn't work very well when defragmentation is enabled for bin-packing. So defragmentation is explicitly disabled for such jobs. For jobs that use a spread strategy, we calculate "spread quotient" which is defined as the ratio of number of tasks to number of unique hosts. The average of spread quotients is tracked to measure the effectiveness of spreading - values closer to 1 are desirable. Test Plan: On minicluster, created jobs with default (pack) and spread placement hints. Verified that tasks very placed as expected in both cases. Enabled defragmentation, and verified that jobs with spread hint are placed as expected. Reviewers: O521 Project Peloton: Add reviewers, mabansal, avyas, varung Reviewed By: O521 Project Peloton: Add reviewers, mabansal Subscribers: apoorvaj, jenkins Maniphest Tasks: T2938775 Differential Revision: https://code.uberinternal.com/D2847817
Commit: | c7eaf10 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Implement limit in GetPod API request to fetch only a subset of the previous runs Summary: Remove current_only from the API to fetch only the latest, and instead use limit equal to 1 to fetch the latest in aurora bridge. Test Plan: Tested using minicluster. Reviewers: zhixin, kevinxu, O521 Project Peloton: Add reviewers Reviewed By: kevinxu, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2817133 Differential Revision: https://code.uberinternal.com/D2854195
Commit: | b9e7b89 | |
---|---|---|
Author: | Arpit Goyal | |
Committer: | Arpit Goyal |
Add watch service to publish mesos task status update events Summary: This functionality is to support listening mesos task updates events using host manager client. This feature is required to run publish status updates as watch topics because directly dumping event stream gives GRPC data size error. Test Plan: 1. Run locally stateless job service.yaml and able to listen mesos task update event {F118621325} Tested after incorporating changes mentioned on the review {F119852325} Reviewers: O521 Project Peloton: Add reviewers, varung Reviewed By: O521 Project Peloton: Add reviewers, varung Subscribers: jenkins Maniphest Tasks: T2977595 Differential Revision: https://code.uberinternal.com/D2829347
Commit: | ec09700 | |
---|---|---|
Author: | Aihua Xu | |
Committer: | Aihua Xu |
Implement the logic for host reservation in resource manager to mark tasks ready for host reservation and send tasks to placement. Summary: Implements the retry logic in resource manager that each task gets retried for a number of attempts in a cycle and a number of cycles. After that, the task is ready for host reservation. Such task is placed in ready queue and given a larger timeout. In placement, tasks from resourcemanager are polled and placed into reserveQueue if they are ready for host reservation. Reserver start()/stop() is also added in hostmgr engine. A configuration enable_host_reservation in resourcemanager can enable/disable host reservation feature. Test Plan: Tested locally to make sure the tasks get reserved and launched properly. Reviewers: mabansal, O521 Project Peloton: Add reviewers, zhixin, amitbose Reviewed By: mabansal, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2687653 Differential Revision: https://code.uberinternal.com/D2657913
Commit: | 70cd156 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add private QueryJobCache API Summary: Add a handler which query job manager internal cache for job info. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton jobmgr query-job-cache -l testKey0=testVal01 -n TestSpec1 result: - job_id: value: d1310a88-e7f1-4a5e-bdeb-be98fd8d6730 name: TestSpec1 - job_id: value: d5ccc804-0747-4e67-b0a9-7b649b35191d name: TestSpec1 ``` Reviewers: apoorvaj, kevinxu, O521 Project Peloton: Add reviewers Reviewed By: kevinxu, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2909325 Differential Revision: https://code.uberinternal.com/D2789659
Commit: | 003da74 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add a CLI command to fetch list of throttled tasks in the system Summary: Add a private API to allow fetching list of throttled tasks, and add a CLI which can invoke this API. Test Plan: Tested on minicluster Reviewers: zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2810865
Commit: | 877e49a | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Revert "Bump up mesos agent to 1.6.2" Summary: This reverts commit 2389da461f3c0f2c8bf851126e19f8f6bd2a7098. The bump was done to fix the mesos master stuck issue seen in stateless performace test `perf_test_stateless_job_update`. Since the bump up didn't fix the issue, reverting the commit Reviewers: amitbose, O521 Project Peloton: Add reviewers Reviewed By: amitbose, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2818991
Commit: | f0b7dbf | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add private API for job manager Summary: Add a private API for job manager and move getting cache and refreshing a job to the private API. All APIs to fetch job manager state which should not be exposed to external customers should be put as private API. TBD is to remove these APIs from public API once new CLI version has been rolled out. Test Plan: Tested using minicluster. Reviewers: zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2803991
Commit: | 62e309f | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add workflow completion time Test Plan: ``` workflow_info: events: - state: WORKFLOW_STATE_SUCCEEDED timestamp: 2019-04-30T18:54:35Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-30T18:54:34Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-30T18:54:34Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-30T18:54:33Z type: WORKFLOW_TYPE_UPDATE instances_added: - from: 0 to: 0 opaque_data: data: "" status: completion_time: 2019-04-30T18:54:35Z creation_time: 2019-04-30T18:54:33.925Z num_instances_completed: 1 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 0-1-1 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-30T18:54:35.609Z version: value: 1-1-1 update_spec: batch_size: 0 in_place: false max_instance_retries: 0 max_tolerable_instance_failures: 0 rollback_on_failure: false start_paused: false start_pods: false ``` Reviewers: apoorvaj, sachins, O521 Project Peloton: Add reviewers Reviewed By: sachins, O521 Project Peloton: Add reviewers Subscribers: adityacb, jenkins Maniphest Tasks: T2500721 Differential Revision: https://code.uberinternal.com/D2738467
Commit: | 2389da4 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Bump up mesos agent to 1.6.2 Reviewers: sachins, O521 Project Peloton: Add reviewers Reviewed By: sachins, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2775523
Commit: | 972baf5 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Add `hostname` field to `GetActiveTasksResponse.TaskEntry` Summary: This diff makes the following changes 1. Add hostname field to `GetActiveTasksResponse.TaskEntry` 2. GetActiveTasksResponse contains all tasks for which resources are held in resource manager. In other words, the response includes tasks in tracker plus orphan tasks 3. The `TaskID` field in `GetActiveTasksResponse.TaskEntry` is `MesosTaskID` instead of `PelotonTaskID` Test Plan: Unit tests Reviewers: avyas, O521 Project Peloton: Add reviewers Reviewed By: avyas, O521 Project Peloton: Add reviewers Subscribers: jenkins Tags: #peloton Differential Revision: https://code.uberinternal.com/D2763689
Commit: | 36542fa | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Cannot update with start_pods set when job is being killed and vice versa Test Plan: stop a job when update with start_pods set going on: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless replace f7ac6663-28c0-4a86-a680-6c70bcd3d031 ~/Desktop/testSpec.yaml 1 /DefaultResPool 1-1-1 --start-pods New EntityVersion: 2-1-2 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless stop f7ac6663-28c0-4a86-a680-6c70bcd3d031 2-1-2 peloton: error: code:aborted message:job has active update with start_pods set, cannot stop the job now // wait for jupdate to finish (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless stop f7ac6663-28c0-4a86-a680-6c70bcd3d031 2-1-2 Job stopped. New EntityVersion: 2-2-2 ``` Reviewers: O521 Project Peloton: Add reviewers, sachins, apoorvaj Reviewed By: O521 Project Peloton: Add reviewers, sachins Subscribers: kevinxu, varung, jenkins Maniphest Tasks: T2687825 Differential Revision: https://code.uberinternal.com/D2734057
Commit: | f4136e0 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Change key of host-to-tasks map to mesos-task-id from peloton-task-id Summary: When jobmgr calls EnqueueGangs for new mesos task (prev runID + 1) before mesos-event for previous run is processed by resmgr, the rmtask in tracker is overwritten but the hosts:tasks map is not updated. This causes the host-to-tasks map to have incorrect information which affects all subsequent placement decisions. Modifying the key of the host-to-tasks map to mesos-task-id (from peloton-task-id) will decouple the host-to-tasks map from the RMTask preventing the race between "EnqueueGangs call" and "Mesos event process by resmgr" This diff makes the following changes. 1. Modify the placements (host:tasks) map in tracker to track tasks (RMTasks) by mesos-task-id instead of peloton-task-id 2. Keep track of the orphan RMTasks in tracker. 3. Deprecate `Placement.Tasks` (peloton-tasks) in favor of `Placement.TaskIDs` (containing both peloton and mesos taskIDs) Test Plan: - Unit tests - Enable auto-rollback aurorabridge integration test Reviewers: varung, O521 Project Peloton: Add reviewers, avyas Reviewed By: varung, O521 Project Peloton: Add reviewers, avyas Subscribers: zhixin, apoorvaj, kevinxu, jenkins Tags: #peloton Maniphest Tasks: T2667303 Differential Revision: https://code.uberinternal.com/D2685969
Commit: | cb576fb | |
---|---|---|
Author: | Kevin Xu | |
Committer: | Kevin Xu |
Add flag to only return current pod_info for GetPod API Summary: Add flag to only return current pod_info for GetPod API, and enable the flag in aurora bridge event publisher. Reviewers: zhixin, sachins, varung, apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2788205 Differential Revision: https://code.uberinternal.com/D2729881
Commit: | 17fd4ed | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Specify InvalidEntityVersionError to be part of API Summary: InvalidEntityVersionError and UnexpectedVersionError are special errors that require users to have a specific action. In more details, the error means there is a write-after-reaed conflict, and multiple users are trying to operate on the job at the same time. Therefore users need to read the job again and get the latest job status, if users reeceive the error. Users need to get the error code and error message to know they get this particualr error, so the error is part of API. Unfortunately, neither yarpc v1 nor proto supports error as part of API. This diff adds comments and tests to remind developers any change to the errors under the file can be a breaking change. Also, it changes the error type of InvalidEntityVersionError, because, InvalidEntityVersionError and UnexpectedVersionError are both due to the same kind of concurrency issue and both should share the same error code. Reviewers: apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: kevinxu, jenkins Maniphest Tasks: T2342747 Differential Revision: https://code.uberinternal.com/D2699399
Commit: | 5356945 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Deprecate respool path in QuerySpec which is not used in JobQuery Reviewers: apoorvaj, sachins, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, sachins, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2710333
Commit: | f806db5 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Change GetWorkflowEventsRequest.limit and ListJobWorkflowsRequest.instance_events_limit to uint32 type Test Plan: Unit tests Reviewers: zhixin, apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2709541
Commit: | 158ef6b | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add PodConfigurationStateStats in job status Summary: PodConfigurationStateStats provides a way to users to get the number of pods which are running on a certain version with certain state. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get fab51751-6131-4ee5-b333-f0094d12a30a job_info: job_id: value: fab51751-6131-4ee5-b333-f0094d12a30a status: creation_time: 2019-04-08T20:13:30.8899375Z desired_state: JOB_STATE_RUNNING pod_configuration_state_stats: 2-0-0: stateStats: POD_STATE_PENDING: 20 POD_STATE_RUNNING: 10 pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 20 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RUNNING: 10 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 revision: created_at: "1554754410889937500" updated_at: "1554754438737243400" updated_by: "" version: "12" state: JOB_STATE_RUNNING version: value: 2-1-2 workflow_status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 2 - 4 - 6 - 7 - 8 - 9 - 10 - 11 - 13 - 14 - 16 - 17 - 18 - 20 - 22 - 23 - 25 - 28 - 29 num_instances_completed: 10 num_instances_failed: 0 num_instances_remaining: 20 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:13:58.753Z version: value: 2-1-2 workflow_info: events: - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:54Z type: WORKFLOW_TYPE_UPDATE instances_updated: - from: 0 to: 29 opaque_data: data: "" status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 2 - 4 - 6 - 7 - 8 - 9 - 10 - 11 - 13 - 14 - 16 - 17 - 18 - 20 - 22 - 23 - 25 - 28 - 29 num_instances_completed: 10 num_instances_failed: 0 num_instances_remaining: 20 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:13:58.753Z version: value: 2-1-2 update_spec: batch_size: 0 in_place: false max_instance_retries: 0 max_tolerable_instance_failures: 0 rollback_on_failure: false start_paused: false start_pods: false (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get fab51751-6131-4ee5-b333-f0094d12a30a job_info: job_id: value: fab51751-6131-4ee5-b333-f0094d12a30a status: creation_time: 2019-04-08T20:13:30.8899375Z desired_state: JOB_STATE_RUNNING pod_configuration_state_stats: 2-0-0: stateStats: POD_STATE_PENDING: 10 POD_STATE_RUNNING: 20 pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 10 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RUNNING: 20 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 revision: created_at: "1554754410889937500" updated_at: "1554754445561984100" updated_by: "" version: "14" state: JOB_STATE_RUNNING version: value: 2-1-2 workflow_status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 6 - 8 - 9 - 14 - 16 - 17 - 23 - 25 - 29 num_instances_completed: 20 num_instances_failed: 0 num_instances_remaining: 10 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:05.539Z version: value: 2-1-2 workflow_info: events: - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:54Z type: WORKFLOW_TYPE_UPDATE instances_updated: - from: 0 to: 29 opaque_data: data: "" status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 6 - 8 - 9 - 14 - 16 - 17 - 23 - 25 - 29 num_instances_completed: 20 num_instances_failed: 0 num_instances_remaining: 10 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:05.539Z version: value: 2-1-2 update_spec: batch_size: 0 in_place: false max_instance_retries: 0 max_tolerable_instance_failures: 0 rollback_on_failure: false start_paused: false start_pods: false (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get fab51751-6131-4ee5-b333-f0094d12a30a job_info: job_id: value: fab51751-6131-4ee5-b333-f0094d12a30a status: creation_time: 2019-04-08T20:13:30.8899375Z desired_state: JOB_STATE_RUNNING pod_configuration_state_stats: 2-0-0: stateStats: POD_STATE_RUNNING: 30 pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 0 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RUNNING: 30 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 revision: created_at: "1554754410889937500" updated_at: "1554754451684076200" updated_by: "" version: "16" state: JOB_STATE_RUNNING version: value: 2-1-2 workflow_status: creation_time: 2019-04-08T20:13:54.963Z num_instances_completed: 30 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 1-1-2 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:11.67Z version: value: 2-1-2 workflow_info: events: - state: WORKFLOW_STATE_SUCCEEDED timestamp: 2019-04-08T20:14:11Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:54Z type: WORKFLOW_TYPE_UPDATE instances_updated: - from: 0 to: 29 opaque_data: data: "" status: creation_time: 2019-04-08T20:13:54.963Z num_instances_completed: 30 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 1-1-2 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:11.67Z ``` Reviewers: apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: kevinxu, sachins, jenkins Maniphest Tasks: T2723669 Differential Revision: https://code.uberinternal.com/D2698701
Commit: | bc3507e | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add PodConfigurationStateStats in job status Summary: PodConfigurationStateStats provides a way to users to get the number of pods which are running on a certain version with certain state. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get fab51751-6131-4ee5-b333-f0094d12a30a job_info: job_id: value: fab51751-6131-4ee5-b333-f0094d12a30a status: creation_time: 2019-04-08T20:13:30.8899375Z desired_state: JOB_STATE_RUNNING pod_configuration_state_stats: 2-0-0: stateStats: POD_STATE_PENDING: 20 POD_STATE_RUNNING: 10 pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 20 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RUNNING: 10 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 revision: created_at: "1554754410889937500" updated_at: "1554754438737243400" updated_by: "" version: "12" state: JOB_STATE_RUNNING version: value: 2-1-2 workflow_status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 2 - 4 - 6 - 7 - 8 - 9 - 10 - 11 - 13 - 14 - 16 - 17 - 18 - 20 - 22 - 23 - 25 - 28 - 29 num_instances_completed: 10 num_instances_failed: 0 num_instances_remaining: 20 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:13:58.753Z version: value: 2-1-2 workflow_info: events: - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:54Z type: WORKFLOW_TYPE_UPDATE instances_updated: - from: 0 to: 29 opaque_data: data: "" status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 2 - 4 - 6 - 7 - 8 - 9 - 10 - 11 - 13 - 14 - 16 - 17 - 18 - 20 - 22 - 23 - 25 - 28 - 29 num_instances_completed: 10 num_instances_failed: 0 num_instances_remaining: 20 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:13:58.753Z version: value: 2-1-2 update_spec: batch_size: 0 in_place: false max_instance_retries: 0 max_tolerable_instance_failures: 0 rollback_on_failure: false start_paused: false start_pods: false (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get fab51751-6131-4ee5-b333-f0094d12a30a job_info: job_id: value: fab51751-6131-4ee5-b333-f0094d12a30a status: creation_time: 2019-04-08T20:13:30.8899375Z desired_state: JOB_STATE_RUNNING pod_configuration_state_stats: 2-0-0: stateStats: POD_STATE_PENDING: 10 POD_STATE_RUNNING: 20 pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 10 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RUNNING: 20 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 revision: created_at: "1554754410889937500" updated_at: "1554754445561984100" updated_by: "" version: "14" state: JOB_STATE_RUNNING version: value: 2-1-2 workflow_status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 6 - 8 - 9 - 14 - 16 - 17 - 23 - 25 - 29 num_instances_completed: 20 num_instances_failed: 0 num_instances_remaining: 10 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:05.539Z version: value: 2-1-2 workflow_info: events: - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:54Z type: WORKFLOW_TYPE_UPDATE instances_updated: - from: 0 to: 29 opaque_data: data: "" status: creation_time: 2019-04-08T20:13:54.963Z instances_current: - 0 - 6 - 8 - 9 - 14 - 16 - 17 - 23 - 25 - 29 num_instances_completed: 20 num_instances_failed: 0 num_instances_remaining: 10 prev_state: WORKFLOW_STATE_INITIALIZED prev_version: value: 1-1-2 state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:05.539Z version: value: 2-1-2 update_spec: batch_size: 0 in_place: false max_instance_retries: 0 max_tolerable_instance_failures: 0 rollback_on_failure: false start_paused: false start_pods: false (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get fab51751-6131-4ee5-b333-f0094d12a30a job_info: job_id: value: fab51751-6131-4ee5-b333-f0094d12a30a status: creation_time: 2019-04-08T20:13:30.8899375Z desired_state: JOB_STATE_RUNNING pod_configuration_state_stats: 2-0-0: stateStats: POD_STATE_RUNNING: 30 pod_stats: POD_STATE_DELETED: 0 POD_STATE_FAILED: 0 POD_STATE_INITIALIZED: 0 POD_STATE_INVALID: 0 POD_STATE_KILLED: 0 POD_STATE_KILLING: 0 POD_STATE_LAUNCHED: 0 POD_STATE_LAUNCHING: 0 POD_STATE_LOST: 0 POD_STATE_PENDING: 0 POD_STATE_PLACED: 0 POD_STATE_PLACING: 0 POD_STATE_PREEMPTING: 0 POD_STATE_READY: 0 POD_STATE_RUNNING: 30 POD_STATE_STARTING: 0 POD_STATE_SUCCEEDED: 0 revision: created_at: "1554754410889937500" updated_at: "1554754451684076200" updated_by: "" version: "16" state: JOB_STATE_RUNNING version: value: 2-1-2 workflow_status: creation_time: 2019-04-08T20:13:54.963Z num_instances_completed: 30 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 1-1-2 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:11.67Z version: value: 2-1-2 workflow_info: events: - state: WORKFLOW_STATE_SUCCEEDED timestamp: 2019-04-08T20:14:11Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_ROLLING_FORWARD timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:55Z type: WORKFLOW_TYPE_UPDATE - state: WORKFLOW_STATE_INITIALIZED timestamp: 2019-04-08T20:13:54Z type: WORKFLOW_TYPE_UPDATE instances_updated: - from: 0 to: 29 opaque_data: data: "" status: creation_time: 2019-04-08T20:13:54.963Z num_instances_completed: 30 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_ROLLING_FORWARD prev_version: value: 1-1-2 state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: 2019-04-08T20:14:11.67Z ``` Reviewers: apoorvaj, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: kevinxu, sachins, jenkins Maniphest Tasks: T2723669 Differential Revision: https://code.uberinternal.com/D2698701
Commit: | 504e104 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Deprecate Mesos references in the v1alpha API Summary: 1. Add native definitions to Peloton API to replace the mesos references. 2. Add corresponding kubelet runtime specific configurations. Reviewers: min, amitbose, O521 Project Peloton: Add reviewers Reviewed By: min, amitbose, O521 Project Peloton: Add reviewers Subscribers: jenkins, pourchet, yiran Differential Revision: https://code.uberinternal.com/D2688253
Commit: | 36e13b9 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Add option to specify limit while requesting workflow events Test Plan: Unit tests Reviewers: zhixin, kevinxu, O521 Project Peloton: Add reviewers Reviewed By: kevinxu, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2738273 Differential Revision: https://code.uberinternal.com/D2698249
Commit: | b517542 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Add switch disable kill tasks request from host manager to mesos master Summary: Add a flag to disable the kill tasks request from host manager to mesos master. CLI is one way command. To unset the flag, host manager restart is required as well as restart will trigger reconciliation. Reviewers: apoorvaj, amitbose, #peloton Reviewed By: apoorvaj, #peloton Subscribers: jenkins, zhixin Maniphest Tasks: T2685313 Differential Revision: https://code.uberinternal.com/D2665295
Commit: | 53a86fc | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add termination status for update and restart Reviewers: apoorvaj, amitbose, #peloton Reviewed By: apoorvaj, #peloton Subscribers: jenkins Maniphest Tasks: T2421567 Differential Revision: https://code.uberinternal.com/D2663089
Commit: | b45286f | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Add SLASpec to JobSummary Summary: SLA spec is a one of the crucial information to qualify for job summary response and it is heavily used by aurorabridge. With addition of SLA spec in job summary, it will prevent calling jobInfo which is much heavier call w.r.t. to job summary. Changeset includes - Add sla spec in job_index table, but it is not required to index. - Update job summary response to include sla spec, and update read/write path to include the information. - Changes at aurora bridge to use job summary rather than job info. Reviewers: apoorvaj, kevinxu, #peloton Reviewed By: kevinxu, #peloton Subscribers: adityacb, jenkins Maniphest Tasks: T2677531 Differential Revision: https://code.uberinternal.com/D2649799
Commit: | e254b20 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add support for filtering in the pod watch API Summary: Add support for the following filters in the pod watch API 1. Filter based on job-id. 2. Filter based on pod names given a job-id is provided. 3. Filter based on pod labels. This will not filter based on job labels. Test Plan: Tested using minicluster Reviewers: zhixin, kevinxu, #peloton Reviewed By: kevinxu, #peloton Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2640567
Commit: | 0b9e2ce | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Start update affected pods if StartPods flag is set Test Plan: case1: stop all instances and update the whole job ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless stop c2ddaafd-9c51-4919-baf5-29b538a4a0bc 1-1-1 Job stopped. New EntityVersion: 1-2-1 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless replace c2ddaafd-9c51-4919-baf5-29b538a4a0bc ~/Desktop/testSpec.yaml 0 /DefaultResPool 1-2-1 --start-pods New EntityVersion: 2-2-2 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton pod query c2ddaafd-9c51-4919-baf5-29b538a4a0bc Pod ID| Name| State| Container Name| Container State| Healthy| Start Time| Run Time| Host| Message| Reason| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-0-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-0| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:11Z| 00:00:29| peloton-mesos-agent1| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-1-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-1| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:17Z| 00:00:23| peloton-mesos-agent2| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-2-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-2| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:24Z| 00:00:16| peloton-mesos-agent0| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-3-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-3| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:24Z| 00:00:17| peloton-mesos-agent0| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-4-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-4| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:11Z| 00:00:29| peloton-mesos-agent0| | | ``` case 2: stop part of instances, and update partial job ``` // stop two instances (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton pod stop c2ddaafd-9c51-4919-baf5-29b538a4a0bc-0 {} (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton pod stop c2ddaafd-9c51-4919-baf5-29b538a4a0bc-1 {} (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless replace c2ddaafd-9c51-4919-baf5-29b538a4a0bc ~/Desktop/testSpec.yaml 0 /DefaultResPool 4-2-4 --start-pods // change instance config for instance 0, instance 1 not affected (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton pod query c2ddaafd-9c51-4919-baf5-29b538a4a0bc Pod ID| Name| State| Container Name| Container State| Healthy| Start Time| Run Time| Host| Message| Reason| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-0-3| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-0| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:40:34Z| 00:00:11| peloton-mesos-agent2| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-1-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-1| POD_STATE_KILLED| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:17Z| 00:02:53| peloton-mesos-agent2| Command terminated with signal Terminated| | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-2-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-2| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:24Z| 00:11:22| peloton-mesos-agent0| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-3-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-3| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:24Z| 00:11:22| peloton-mesos-agent0| | | c2ddaafd-9c51-4919-baf5-29b538a4a0bc-4-2| c2ddaafd-9c51-4919-baf5-29b538a4a0bc-4| POD_STATE_RUNNING| | CONTAINER_STATE_INVALID| HEALTH_STATE_DISABLED| 2019-03-21T19:29:11Z| 00:11:35| peloton-mesos-agent0| | | ``` Reviewers: apoorvaj, kevinxu, #peloton Reviewed By: kevinxu, #peloton Subscribers: varung, jenkins Maniphest Tasks: T2664629 Differential Revision: https://code.uberinternal.com/D2636219
Commit: | 6d3fda5 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Store task labels in the cache Summary: Task labels in the cache will be used to filter the tasks events in the watch API. Whenever, the configuration version in the task runtime changes, or tasks get replaced in the cache, update the task labels in the cache as well. This ensures that the labels stored in the cache are always pointing to the configuration being pointed to by the task runtime in the cache. Reviewers: zhixin, #peloton Reviewed By: zhixin, #peloton Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2615673
Commit: | 9abc2e1 | |
---|---|---|
Author: | Kevin Xu | |
Committer: | Kevin Xu |
Implement Watch API for pod Summary: Implement watch api for pod events - leaving out pod filter support for now - the clients will receive all the pod event changes. The watch api consists several parts: 1. listener.go: Registers a listener to job cache, listening for job / task changes, the events will be forwarded to the watch processor. 2. processor.go: Centralized place for processing and routing all the events to the corresponding clients. It also manages the creation and deletion of the client structs in memory. Each client will have its own dedicated event send buffer, size limited by `buffer_size` config. While processor is pushing the events to the buffer, at the same time, the watch api handler constantly pulls from the buffer, and streams the response back to client. Currently, there are two configs which controls the watch api behavior: 1. buffer_size: Maximum number of events that will be buffered in the memory before the server throws an error and closes the connection. Note that this number does not count the events that already buffered in the TCP connection. 2. max_client: Maximum number of concurrent clients the watch api is able to handle. If the number is reached, an error will be thrown immediately when Watch() is called. Test Plan: 1. Execute unit tests. 2. Spin up minicluster, start pod watch client using peloton cli `peloton watch pod`, create a new peloton job using cli, wait for pod events coming in, open an another terminal, use peloton cli to cancel the watch stream `peloton watch cancel <watch-id>`, the original peloton watch command should terminate. Reviewers: varung, apoorvaj, zhixin Reviewed By: varung Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2594805
Commit: | 0dfecb5 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Remove held on task when task is killed while not launched Summary: Hold on host is released when the task is launched. However, if a task is killed before it is launched, the host cannot get released. The solution is to let ResMgr call hostmgr to release the hold when it removes a task from its track. Reviewers: apoorvaj, varung, avyas Reviewed By: varung Subscribers: jenkins Maniphest Tasks: T2520317 Differential Revision: https://code.uberinternal.com/D2585699
Commit: | f9de7b2 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Mayank Bansal |
Enable in-place update/restart feature Reviewers: apoorvaj, varung, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2548111 Differential Revision: https://code.uberinternal.com/D2570345
Commit: | 8e61193 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Mayank Bansal |
Put host on held for in-place update Test Plan: ``` 10 runs of jobs with 30 tasks without batch size: non-in-place: 2019-02-01 14:58:52,543 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 71 2019-02-15 13:57:07,711 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 1 10 runs to jobs with 20 tasks without batch size: non-in-place: 2019-02-01 15:28:31,096 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 72 2019-02-15 14:11:34,756 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 1 ``` Reviewers: apoorvaj, varung, O521 Project Peloton: Add reviewers Reviewed By: varung, O521 Project Peloton: Add reviewers Subscribers: yuweit, jenkins Maniphest Tasks: T2495899, T2503699 Differential Revision: https://code.uberinternal.com/D2490689
Commit: | 6b83ed4 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Enable in-place update/restart feature Reviewers: apoorvaj, varung, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2548111 Differential Revision: https://code.uberinternal.com/D2570345
Commit: | e4902e6 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Put host on held for in-place update Test Plan: ``` 10 runs of jobs with 30 tasks without batch size: non-in-place: 2019-02-01 14:58:52,543 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 71 2019-02-15 13:57:07,711 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 1 10 runs to jobs with 20 tasks without batch size: non-in-place: 2019-02-01 15:28:31,096 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 72 2019-02-15 14:11:34,756 - tests.integration.stateless_job_test.test_update - INFO - total mismatch: 1 ``` Reviewers: apoorvaj, varung, O521 Project Peloton: Add reviewers Reviewed By: varung, O521 Project Peloton: Add reviewers Subscribers: yuweit, jenkins Maniphest Tasks: T2495899, T2503699 Differential Revision: https://code.uberinternal.com/D2490689
Commit: | 53786d2 | |
---|---|---|
Author: | Min Cai |
Initial commit to sync Peloton repo to github. This commit squashes the following changes: 2019-02-27 d33c0d29 Mayank Bansal Fixing production config read for docker 2019-02-27 bc50e385 Cody Gibb Change aurorabridge port from 8082 to 5396 2019-02-27 26e58269 Varun Gupta Delete job if in PENDING state 2019-02-13 21c5f4e9 Anant Vyas Change Timer to Histogram for SLA measurement 2019-02-26 834ae990 Kevin Xu Fixes regarding AuroraBridge job labels 2019-02-26 e346a95d Apoorva Jindal Add HTTP health check support to Peloton 2019-02-26 3da53fa7 Kevin Xu Remove link to phab in bridge integration test 2019-02-26 1f6a1dba Mayank Bansal Fixing documents 2019-02-25 34a0654d Mayank Bansal Adding slack channel to readme 2019-02-07 8c1b4b7e Anant Vyas [resource manager] API handler should be blocked until recovery is completed 2019-02-25 3397334e Min Cai Remove more uber internal links 2019-02-25 4309352b Kevin Xu Add integration tests for aurorabridge read path 2019-02-25 d1e18205 Anant Vyas Fix the `docker-push` script 2019-02-25 38ffe036 Tyra Ho Cassandra ORM Secret migration. 2019-02-25 af8ac7ae Anant Vyas Remove internal docker registry references 2019-02-22 b18327fc Cody Gibb Fix version used for workflow actions in aurorabridge 2019-02-21 cf648c92 Mayank Bansal Removing example files and adding them into client repo 2019-02-22 c67da88b Kevin Xu Wait for deletion to finish for each aurora bridge integration test 2019-02-22 f5444829 Zhixin Wen Fix exponential backoff for failed task retry 2019-02-21 8db83f9a Sishi Long Add PodEvents object to ORM 2019-02-21 a4aa070e Mayank Bansal Removing arc config from the repo 2019-02-21 6ab8fd8f Mayank Bansal Removing uberinternal references from code 2019-02-20 e76e46ae Aditya Bhave Add JobConfigOps to ORM code 2019-02-20 e99ea715 Mayank Bansal Removing production configs from peloton repo 2019-02-19 205df2bf Sachin Sharma Add jobType field to task cache 2019-02-19 02b95c38 Varun Gupta Use yaml file for aurora job configs 2019-02-19 e9015b4d Aditya Bhave Fix regression in cached job Delete 2019-02-15 a932ca01 Kevin Xu Fix job name to job id mapping being created multiple times 2019-02-15 aabbb226 Varun Gupta Add instance workflow events with list job updates 2019-02-05 d11fabb1 Apoorva Jindal Add performance benchmark tests for stateless 2018-12-28 f2b906d7 Kevin Xu Change minicluster components to talk to each other via local container ip 2019-02-14 5d6dd58d Aditya Bhave Add support for job_name_to_id to ORM 2018-12-12 cc206b68 Apoorva Jindal Cleanup of v1alpha APIs 2019-02-14 0958bd82 Cody Gibb Write first aurorabridge integration test w/ utils 2019-02-07 279fe222 Aditya Bhave Delete stateless job ID from active jobs list on delete 2019-02-14 dfbc023f Evelyn Liu Update push registry in prime 2019-02-14 310a8df4 Adam Backer Populate reason field for deadline exceeded tasks 2019-02-13 2c6a4d66 Evelyn Liu Migrate peloton build process to uBuild 2019-02-11 a07a2304 Anant Vyas Remove container spec from controller integration test 2019-02-11 a5596564 Aditya Bhave Support GetAll ORM operation 2019-02-12 4182abca Aditya Bhave Remove dual writes for task_config 2019-02-12 ad34132f Kevin Xu Add additional checks for "not-found" error in aurorabridge 2019-02-11 43e0793c Anant Vyas Fix peloton tutorial documents 2019-02-06 8ea6de55 Cody Gibb Re-implement GetJobUpdateDetails / GetJobUpdateSummaries w/ rollback handling 2019-02-09 ca4ef844 Min Cai Fix document links in README.md 2019-02-09 32fed88d Min Cai Restructure Peloton documentation 2019-02-06 24ac5c11 Sishi Long Add doc for Cli section 2019-02-07 882ecda4 Kevin Xu Implement thermos executor in aurora bridge update path. 2019-01-29 1c9ba9ae Anant Vyas Add more information about tasks which breach SLAs 2019-02-06 a43dbae3 Cody Gibb Handle new job in GetJobUpdateDiff 2019-01-30 18694fbe Amit Bose Use ORM for mutations to job_index table 2019-02-05 64f40463 Sishi Long Move flag into jobmgr config 2019-02-05 0444d106 Apoorva Jindal Maintain a map of entity version to task count for stateless jobs in job status 2019-02-05 509b3fe2 Sachin Sharma Add integration tests for starting/stopping job with an active update 2019-02-05 e7776351 Varun Gupta Fix host filter for revocable tasks at mimir placement engine 2019-02-05 798ab056 Sishi Long Force cache recalculation if jobRuntimeCalculationViaCache flag is on 2019-02-04 8d6f2ae0 Cody Gibb Support update metadata 2019-02-01 874331fe Zhixin Wen Placement Engine prioritizes host with desired host name 2019-02-01 6ef034e1 Cody Gibb Support new CreateJob API in startJobUpdate 2019-02-04 897372d9 Cody Gibb Only start rollback as paused if awaiting pulse 2019-02-03 95f95548 Amit Bose Provide the correct option to vcluster from perf-compare 2019-02-01 ce24244b Tyra Ho Report integer division fix. 2018-10-29 b29489b9 Amit Bose Remove Uber-specific details from vcluster 2019-02-01 905963d0 Varun Gupta Use opaque data to get job update action 2019-01-31 120825d1 Cody Gibb Fix rollback update spec 2019-02-01 7b2e5108 Kevin Xu Implement TODO fields in AuroraBridge. 2019-02-01 59fbc9b2 Zhixin Wen Add create control flags in stateless job create API 2019-01-31 bb972332 Sishi Long Modify job config generation test script 2019-01-29 5174378b Kevin Xu Change getTasksWithoutConfigs to include pods from previous run 2019-01-31 cafb3d15 Zhixin Wen Pass in-place update hint between components 2019-01-31 90725ddb Tyra Ho Install Jinja2 package in vcluster. 2019-01-25 b307fa45 Tyra Ho stage 2019-01-31 6e9645f3 Charles Raimbert Add ability to pass Cassandra username & password to `migrate-db-schema` script 2019-01-31 c462624a Aditya Bhave Use secrets.yaml file to read Peloton langley secrets 2019-01-30 0a4a47be Zhixin Wen Job stop would stop new instances added in update 2019-01-31 d4165c94 Zhixin Wen Stateless job creation is processed by update workflow 2019-01-25 c4c04918 Aditya Bhave Allow backfill of active jobs only during resmgr recovery 2019-01-30 4b70e6d9 Sachin Sharma Clear previous contents of maintenanceHostInfoMap when reconciling maintenance stat 2019-01-29 0f459948 Varun Gupta Implement getJobUpdateDetails 2019-01-28 4c368b24 Cody Gibb Add rollback state handling in NewJobUpdateStatus 2019-01-29 25647089 Zhixin Wen Add update lifecycle docs 2019-01-22 90c45c67 Anant Vyas Fix state machine errors when transition is a no-op 2019-01-25 1221ddbc Sachin Sharma Do not set goal state on task initialize 2019-01-28 3ea29aad Anant Vyas [resource manager] Remove default preemption duration from code 2019-01-25 fefca257 Tyra Ho Improve Performance report layout 2019-01-25 d52c5a3f Zhixin Wen Set completion time for all terminated task event 2019-01-24 4d3363b1 Anant Vyas Remove the use of EnqueueGangs from placement engine 2019-01-24 51e6c823 Cody Gibb Implement RollbackJobUpdate 2019-01-24 391d3924 Cody Gibb Factor out aurorabridge concurrency into utility func 2019-01-24 ee2f680b Tyra Ho Cherrypick the env variable set. 2019-01-24 6de68d1b Tyra Ho Add Flag `task_preemption_period` to Resource Manager 2019-01-24 9b8baada Cody Gibb Enforce update id checks on all update actions 2019-01-23 97202ec7 Kevin Xu Implement GetTierConfigs 2019-01-23 e7b96938 Aditya Bhave Fix error code being returned in Task.Get() 2019-01-23 0b1fee2e Kevin Xu Follow up changes for getTasksWithoutConfigs 2019-01-23 eb68ad51 Cody Gibb Implement KillTasks 2019-01-23 17a9fa17 Sachin Sharma Changelog for release-0.8.1 2019-01-23 0a4a4a56 Zhixin Wen Unify stateless handler non leader handle and error 2019-01-23 2aa2e77c Kevin Xu Fixing merge conflict 2019-01-22 790aa7da Kevin Xu Implement GetJobs 2019-01-22 1c64b119 Cody Gibb Remove partial job key usage 2019-01-22 728ef536 Anant Vyas Remove materialized view `mv_respool_by_owner` 2019-01-22 41d765cc Zhixin Wen Reset hostname when task is reinitialized 2019-01-22 25654bdd Varun Gupta Refactor workflow info & workflow status 2019-01-22 d535f89b Tyra Ho Use `testify.suite` library in `launcher_test.go`. 2019-01-22 6ee67630 Cody Gibb Lazily bootstrap respool in aurorabridge 2018-12-12 253bfb4d Apoorva Jindal Add support for ListPods API 2019-01-18 36ce49c4 Tyra Ho address comments. 2019-01-18 414e007a Cody Gibb Add opaquedata package for simulated unsupported update actions 2019-01-18 b7a7df6d Varun Gupta Add aurorabridge deploy 2019-01-18 feb7f6fc Kevin Xu Implement GetJobSummary 2019-01-17 f174d1b5 Apoorva Jindal Add documentation on how to submit peloton jobs 2019-01-11 a282ef6d Apoorva Jindal Add doc for types of jobs supported in peloton. 2019-01-18 dfdede8a Sishi Long Get taskStats from cache and add logging/metric on re-calculate job runtime from ca 2019-01-18 7292f51a Varun Gupta Implement getJobUpdateDiff 2019-01-17 4c34f9e3 Varun Gupta Implement getConfigSummary 2019-01-08 c55a5058 Anant Vyas [Resource Manager] Refactor resource tree to not be a singleton 2019-01-17 94c0f3ba Aditya Bhave Fix styling nits from secret formatter code 2019-01-17 1ffe2ca2 Zhixin Wen Fix isJobStateStale 2019-01-16 27947e20 Sachin Sharma Always enqueue ongoing update for stateless jobs 2019-01-16 56f2d658 Sachin Sharma Dump jobmgr logs in integration tests only on test failure 2019-01-16 8778455d Sachin Sharma Fix ConvertPodSpecToTaskConfig to not panic on empty containers 2019-01-15 b4f2967b Sachin Sharma Validate new job spec in JobService.ReplaceJob v1alpha API 2019-01-15 9371322a Cody Gibb Implement pulseJobUpdate w/ TODO 2019-01-15 9df8af43 Tyra Ho Make jobs created by performance tests non-preemptile. 2019-01-15 40f2c2d8 Anant Vyas Refactor placemenet engine log level settings 2019-01-14 e8b774d6 Sachin Sharma Migrate stateless_job_test/test_job_workflow to v1alpha 2019-01-16 2ec9dc86 Zhixin Wen Use interface methods to publish metrics 2019-01-11 29ee6c1c Amit Bose Job config documentation 2019-01-14 5db88355 Anant Vyas Check invalid tasks when dequeueing for placing 2019-01-15 3a01a7e2 echung Make generte-protobuf.py run on python2 2019-01-14 2baec1bd Aditya Bhave Add scaffolding code to help ORM migration 2019-01-14 b9818e28 Sachin Sharma Add integration tests for v1alpha JobService.DeleteJob 2019-01-15 6de715e2 Sachin Sharma Fix RestartJob - update JobState on restarting KILLED job 2019-01-11 08c91c55 Mayank Bansal Adding architecture section 2019-01-15 8b49c352 Aditya Bhave Add aurora auth support for deploy script 2019-01-14 bddf8722 Varun Gupta Add job update state change events 2019-01-04 4caa243c Sishi Long Update Cli to read config from a well defined set of paths 2019-01-15 96365ff5 Varun Gupta Implement getJobUpdateSummaries 2019-01-14 c09fec5c Anant Vyas Second attempt at fixing the placement integration test 2019-01-14 b2e5de6b Cody Gibb Add aurorabridge/fixture documentation 2019-01-14 660c3c7b Cody Gibb Fix integration tests 2019-01-09 0953f392 Amit Bose Implement task/pod termination status 2019-01-11 4cb74f66 Aditya Bhave Add engdocs for Peloton Clients 2019-01-14 eb305f0b Sachin Sharma Address issues in v1alpha JobService.DeleteJob 2019-01-07 2de9c482 Anant Vyas Use SetPlacment API to return failed placements 2019-01-11 545c550b Cody Gibb Make aurorabridge bootstrapper retry until resmgr leader found 2019-01-11 32ee568e Anant Vyas Add docs for job and task lifecycle 2019-01-11 ac59420f Sachin Sharma Add doc for host maintenance 2019-01-11 e16f1c13 Cody Gibb Write aurorabridge integration test client 2019-01-11 fade64f4 Amit Bose Add section for API reference 2019-01-11 b3a5db9a Aditya Bhave Add security engdocs for secrets management feature 2019-01-11 e0513240 Mayank Bansal Adding introduction section and fixing the eng doc format 2019-01-11 0ddff4cd Zhixin Wen Do not persist default config into db if not exist 2019-01-10 209e808a Kevin Xu Implement GetTasksWithoutConfigs 2019-01-09 ab0f40e4 Sachin Sharma Migrate stateless job start/stop to v1alpha 2019-01-10 9ea0f244 Aditya Bhave Fix style nits in ORM code 2019-01-09 09e7cc7a Mayank Bansal Removing rst files and adding md files 2019-01-07 8d269073 Mayank Bansal Renaming pcluster to minicluster 2019-01-09 0337ebe2 Zhixin Wen Fix flaky test_auto_rollback_reduce_instances 2019-01-09 dff44730 Zhixin Wen Fix flaky test__failed_task_throttled_by_exponential_backoff 2019-01-10 71619336 Zhixin Wen Fix err handling for task delete timeout 2019-01-09 253e9486 Anant Vyas Fix integration test for placement engine 2019-01-09 1069c098 Zhixin Wen Restart checks if goal state is terminal 2019-01-03 262c58d3 Amit Bose Add TerminationStatus to task/pod runtime status 2019-01-08 29a1ecaa Min Cai Delete unused benchmark configs 2019-01-09 cb2b8e7f Charles Raimbert Adapt building script for new build machines 2019-01-07 6b272e0c Min Cai Move mimir lib to placement/plugins/mimir/lib 2019-01-08 462d208e Tyra Ho Add integration tests for task query. 2019-01-08 04e9257f Kevin Xu Move aurora thrift file into aurorabridge folder 2019-01-07 f12590b8 Zhixin Wen Enqueue job when task delete 2019-01-08 d68d7482 Anant Vyas Add failure message along with reason in integration test 2019-01-08 91bdaa38 Sachin Sharma Set PodSpec.PodName in QueryPodsResponse 2019-01-04 61384ae0 Sachin Sharma Implement JobService.DeleteJob v1aplha API 2019-01-08 1a9f43a8 Sachin Sharma Set PodSpec.PodName in GetPodResponse 2019-01-07 4bd5c955 Sachin Sharma Disable log dump in integration tests on cluster teardown 2019-01-07 c44009cb Sishi Long Fix job update performance test timing issue 2019-01-07 e9211df3 Cody Gibb Clean up aurorabridge labels 2019-01-07 9bf78566 Min Cai Add Apache license to source files 2019-01-07 b3533e99 Min Cai Revert "Fix command print." 2019-01-07 f55122ed Anant Vyas Update integ test framework to show failure reason when a job fails 2019-01-07 c689e4ca Kevin Xu Implement get job ids from aurora task query 2019-01-04 e67475bd Cody Gibb Implement AbortJobUpdate 2019-01-03 6d35f139 Amit Bose Keep more logs for Peloton components in vcluster 2019-01-03 c75abf22 Min Cai Change git repo name to github.com/uber/peloton 2019-01-04 8c9ca2e7 Zhixin Wen Migrate tests in stateless_job_test/test_update.py to v1 alpha API
 2019-01-03 7b6750c2 Cody Gibb Implement ResumeJobUpdate 2018-12-07 8cde3509 Anant Vyas Add go runtime metrics 2019-01-03 3395d8e6 Amit Bose Improve logging in perf tests 2019-01-03 e4ec40b0 Cody Gibb Implement PauseJobUpdate 2019-01-02 92b09635 Varun Gupta Add changelog 0.8.0 2019-01-02 67a191b2 Cody Gibb Implement startJobUpdate 2019-01-02 32eab420 Varun Gupta Add GetJobUpdate API 2019-01-03 93c5cb59 Anant Vyas Update API docs 2018-12-29 294941d7 Sachin Sharma Implement JobService.StartJob v1alpha API 2019-01-02 7eb8df04 Anant Vyas Fix SLA tracking for tasks in resmgr 2018-12-06 fd67c58a Varun Gupta Update to token aware host policy with dc aware round robin 2018-12-04 debcc290 Anant Vyas Add engdocs for preemption 2019-01-02 04bc2fa6 Varun Gupta Fix revocable integ test typo error 2018-12-28 3980b200 Sachin Sharma Implement JobService.QueryPods v1alpha API 2018-12-31 b878c77a Amit Bose Ensure vcluster is cleaned-up when perf test fails 2018-12-31 fe27fe77 Amit Bose Improve vcluster error handling 2018-12-31 e91e858f Amit Bose vcluster changes for stateless jobs 2018-12-31 eeb0e4b6 Zhixin Wen Migrate tests in stateless_job_test/test_job_revocable.py to v1 alpha API 2018-12-31 ff9ad893 Aditya Bhave Fix unit test for orm client 2018-12-31 295b846b Zhixin Wen Migrate tests in stateless_job_test/test_job.py to v1 alpha API 2018-12-19 152a7e5f Anant Vyas Update golang version to 1.11.4 2018-12-28 d24cbbb3 Zhixin Wen Migrate part of stateless job integration tests to v1 alpha api 2018-12-27 d426a571 Aditya Bhave Implement ORM update functionality 2018-12-28 2598992d Zhixin Wen Implement ListJobUpdates 2018-12-27 0daee53b Anant Vyas Add integration test for task level preemption 2018-12-27 79429fd2 Aditya Bhave Fix review comments in ORM delete path 2018-12-27 f1f404c7 Kevin Xu Wire up ExecutorInfo in ContainerSpec when calling v1 Create/ReplaceJob api 2018-12-26 29dbd509 Aditya Bhave Implement ORM Delete functionality 2018-12-18 aefecefc Varun Gupta Fix job cleanup for revocable integ test 2018-12-26 e693c531 Varun Gupta Add implementation to read/write workflow events per instance 2018-12-26 b0eb79c0 Anant Vyas Add sla tracking config to deploy script 2018-12-17 5d35df8e Aditya Bhave Get task configs from task_config_v2 first 2018-12-26 ec23c07b Varun Gupta Fix update-protobuf gens 2018-12-14 c0272452 Sishi Long Remove -u flag from packr installation 2018-12-23 d17d1de1 Zhixin Wen Implement job restart 2018-12-13 8c93b623 Aditya Bhave Part III: Setup recovery from active_jobs table 2018-12-21 168c9e3d Zhixin Wen Implement JobSVC.Stop 2018-12-20 2f7d7672 Sachin Sharma Add GetJob CLI and unit tests 2018-12-19 04ca9f64 Sachin Sharma Implement JobService.CreateJob API 2018-12-18 6738697b Kevin Xu Support Thermos Executor in Peloton 2018-12-19 9404b95b Cody Gibb Re-generate Aurora thrift bindings with thriftrw dev branch 2018-12-18 512ee164 Sachin Sharma GetPod should populate the right PodSpec for previous runs 2018-12-10 bfedd51c Apoorva Jindal Add support to store opaque data along with an update 2018-12-18 cdecdf40 Varun Gupta Add handler implementation for job name to job id 2018-12-14 bd16760c Anant Vyas Use `PREEMPTING` as goal state in job manager for batch 2018-12-18 a6798f77 Amit Bose Honor retry policy for lost tasks 2018-12-18 f607ba36 CURRENT) to Performance Report table. Add PERF_version.no_(BASE 2018-12-17 1d7d8447 Varun Gupta Generate yarpc binding for aurora thrift api 2018-12-14 753420f7 Zhixin Wen Ignore UNITILIAZED job without config upon recovery 2018-12-17 a8c0d161 Tyra Ho Fix performance test version issues. 2018-11-30 cfa1084f Apoorva Jindal Use deadline queue in the goal state engine as the queue serving the worker pool 2018-12-12 030b4274 Apoorva Jindal Add ReplaceJobDiff API to find the instances being added/removed/updated in a repla 2018-12-16 e1ad1a8b Aditya Bhave Fix archiver flag, and context timeout issue 2018-12-13 147a9b70 Varun Gupta Add get job id from job name interface definition 2018-11-14 86450c19 Anant Vyas Add support for task level preemption 2018-12-13 7a7a3ad7 Sishi Long Feature flag to calculate job runtime from cache if MV diverged 2018-12-14 1a34bed1 Aditya Bhave Manually patch glide.lock to correct commit id for go-internal 2018-12-05 f20da348 Apoorva Jindal Implement v1alpha ListJobs streaming API 2018-12-11 dca99115 Varun Gupta Add znode for peloton-aurora-bridge to mock Aurora Leader 2018-12-13 618a36b8 Aditya Bhave Change archiver stream_only_mode flag default value 2018-12-13 c14524e7 Cody Gibb Upgrade testify to 1.2.0 2018-12-12 4b55a45b Aditya Bhave Part II: Recover jobs from active_jobs table 2018-12-13 5e95265e Zhixin Wen Add pause/resume/abort job workflow 2018-12-12 e086aa63 Sishi Long Get zk configuration file into go binary 2018-12-12 5b9d89e7 Tyra Ho In Perf tests, Wait 30 Sec Before Creating Respool. 2018-11-30 fbcf28a8 Aditya Bhave Add task_config_v2 table partitioned by job_id, version, instance_id 2018-12-10 6812befb Zhixin Wen Implement stateless service QueryJobs 2018-12-10 89475362 Aditya Bhave Fix bin_packing setting in deploy script 2018-12-10 cdd0ab8a Aditya Bhave Handle hostmgr bin packing flag in deploy script 2018-12-10 9805bb17 Sachin Sharma Create task configs in parallel 2018-12-07 72d8b8a2 Zhixin Wen Add ReplaceJob handler 2018-12-07 d6223d36 Cody Gibb Scaffold aurorabridge daemon 2018-11-30 4bf27ae7 Anant Vyas Add short term fixes to improve performance of SLA tracking 2018-12-07 7df71bce Tyra Ho Add two tables (job.Get and job.Update) to Peloton performance report. 2018-12-06 6984e272 Varun Gupta Add storage APIs for job id from job name 2018-12-05 3af6a39c Apoorva Jindal Implement v1alpha GetJob API 2018-12-03 a64ac409 Sachin Sharma Add PodService.DeletePodEvents handler 2018-12-04 d53ff103 Zhixin Wen Add entity version validation to workflow operations 2018-12-03 3924ac2b Apoorva Jindal Add APIs to fetch workflow information and events 2018-12-03 2d44d9fb Kevin Xu Add CLI support to list all the available hosts 2018-12-03 384188ab Gautam Kulkarni Enhance optional arguments for customized pcluster usage 2018-12-03 f4261841 Anant Vyas Add changelog for 0.7.8.1 2018-11-30 2120bcf3 Apoorva Jindal Support starting an update in paused state 2018-12-03 d4bc38aa Anant Vyas Disable bin packing by default 2018-10-25 6d2b291a Anant Vyas Move Requeue of un-placed tasks inside rmtask 2018-11-30 70bdd7d2 Sishi Long Create performance test for batch job update 2018-11-30 fa38ea0b Zhixin Wen Reenable job factory metrics 2018-11-30 eeaf0424 Zhixin Wen Fix data race when job factory publish metrics 2018-11-30 663acdf2 Zhixin Wen Refactor JobMgr cache to guard workflow cache by job cache lock 2018-08-17 0eaa6fe9 Aditya Bhave Storage Architecture v2 first cut 2018-11-30 63cc2530 Varun Gupta Add job_name to job_id mapping 2018-11-29 229f8cbe Tyra Ho [Refactor] Job's wait_for_condition method 2018-11-21 18b1a1d3 Aditya Bhave Part I: Use active_jobs table instead of mv_jobs_by_state for recovery 2018-11-29 e081786e Anant Vyas Add changelog for 0.7.8 2018-11-27 d9d14d3c Apoorva Jindal Update job index when job configuration is updated 2018-11-28 893702fe Varun Gupta Revert "Constraint job and task configurations at DB" 2018-11-28 db290ca2 Varun Gupta Revert "Converge recent job config version for all batch tasks" 2018-11-28 134ab093 Zhixin Wen Move entity version construct into util 2018-11-28 755de7e8 Apoorva Jindal Do not untrack stateless jobs from the cache 2018-11-28 34c04073 Varun Gupta Revert "Batch job of INITIALIZED state can be updated" 2018-11-28 ae40dc68 Varun Gupta Revert "Reenable job factory metrics" 2018-11-28 a22c5502 Sachin Sharma Add PodService.GetPod handler 2018-11-21 e8da2482 Mayank Bansal Adding asynchronous bin pacing computation support and enable DEFRAG bin packing al 2018-11-28 02995869 Sachin Sharma Add PodService.PodStop handler 2018-11-27 91bc8cd4 Zhixin Wen Set agent id to pointer of empty string on task launch failure 2018-11-27 adb5bfe5 Varun Gupta Add actual tasks allocation for Mesos 2018-11-20 98842aac Apoorva Jindal Add integration test for manual rollback 2018-11-21 a115b2d8 Zhixin Wen reduce cachedjob.AddTask lock contention 2018-11-20 e3edc697 Zhixin Wen Use convertTaskStateToPodState to convert task state 2018-11-20 f56eb83d Apoorva Jindal Image is reported in the container status, hence it is not needed in pod summary as 2018-11-20 cb88cea3 Varun Gupta Deprecate write/read path for task_state_changes table 2018-11-07 0f24e1c6 Apoorva Jindal Add API for multiple containers in a pod. 2018-11-16 8f992511 Sachin Sharma Fix TaskSVC.GetPodEvents to return events for all runs 2018-11-16 cfdc67f5 Sachin Sharma Add PodService.RestartPod handler 2018-11-15 8e8df63e Sachin Sharma Add PodService.BrowsePodSandbox handler 2018-11-15 33b603ac Zhixin Wen Add JobSVC.RefreshJob 2018-11-13 10e3b7ef Zhixin Wen Add JobSvc.GetJobCache 2018-11-14 951491ba Sachin Sharma Push images to Kraken 2018-11-13 f6458644 Varun Gupta Fix data race issues with respool interface 2018-11-13 1ee25bab Sishi Long Lookup zk info by providing a cluster name 2018-11-13 7eae3b8e Varun Gupta Update respool doc with revocable tasks 2018-11-13 f60f9d9b Zhixin Wen Fix JobSvc.Create error when the previous create call fails 2018-11-12 7adb87ff Varun Gupta Dump host manager event stream via CLI 2018-11-12 5cfd265e Apoorva Jindal Handle instances with different desired config than their current config 2018-11-06 b1379f9b Aditya Bhave Add stress test for job create/get to vcluster 2018-11-12 bae312ac Zhixin Wen Add PodSvc.StartPod 2018-10-19 7644e2a5 Anant Vyas Add reason for failure in placement engine 2018-10-29 e4aa296c Aditya Bhave Compress job config before storing it to DB 2018-11-12 7e0c0fef Anant Vyas Increase runtime of non-preemptible test job 2018-11-08 f7fd195f Apoorva Jindal Do not unset the job update identifier after update is complete 2018-11-08 29a80295 Sachin Sharma Add CLI command to kill all jobs owned by a given owner 2018-11-07 07ff4a93 Zhixin Wen Add PodSVC.RefreshPod 2018-10-31 a4395763 Amit Bose v1alpha: Define APIs for watching for changes 2018-11-07 1f8e13fa Varun Gupta Add changelog for 0.7.7.3 2018-11-07 4cfc3a80 Sachin Sharma Add GetPodEvents API handler 2018-10-23 29028fc9 Apoorva Jindal Handle race conditions during consecutive updates 2018-11-07 190bf803 Sachin Sharma Fix deadlock in resmgr/respool/respool.go 2018-11-06 5396e373 Zhixin Wen Fix flaky update integration test 2018-11-06 00d5c85a Zhixin Wen Implement PodSVC.GetPodCache 2018-11-05 f464928a Zhixin Wen Add ConvertToYARPCError for pod service handler 2018-11-02 65f34028 Zhixin Wen Add integration test for pause and resume update 2018-08-23 a8fe8539 Amit Bose Framework for failure testing 2018-11-02 0baf8b9a Zhixin Wen Fix update stuck when instances failed 2018-10-30 f09ddc46 Amit Bose v1alpha Add APIs to get all jobs/tasks as a stream 2018-11-01 a9bbca01 Zhixin Wen Add integration tests for update and health check 2018-11-01 46b8f4e3 Sishi Long Add sorting by instanceId, name, host, reason for task query 2018-11-01 a284ad5a Varun Gupta Dedupe pending task status update acknowledgement 2018-11-01 89bdbd1b Sachin Sharma Change log level when field not found in object (FillObject) 2018-11-01 4c3e9b86 Varun Gupta Remove test__host_limit integration test from smoke test list 2018-10-31 9554e3f6 Sachin Sharma Add dummy PodService V1 Alpha API handler 2018-10-31 2989e287 Zhixin Wen Fix healthState when healthcheck is not enabled 2018-10-31 9f8948c6 Zhixin Wen Task uses new task config when reinitialized 2018-10-31 52f0d93d Sachin Sharma Add dummy JobService V1 Alpha API handler 2018-10-30 e41957e1 Varun Gupta Fix preemption for revocable tasks 2018-10-30 8fd867c3 Apoorva Jindal Fix v1aplha API by replacing stateless_job with stateless. 2018-10-29 93425dda Varun Gupta Calculate slack entitlement for non-revocable resources from main entitlement 2018-10-12 ead07e19 Apoorva Jindal Add peloton v1alpha API for stateless jobs. 2018-10-26 ec29cbc4 Anant Vyas Fix preemption integration test and add it to default tag 2018-10-25 6ba73f0f Varun Gupta Add metrics by tasks state at event stream 2018-10-26 4016190e Apoorva Jindal If cleanup of one update fails, continue cleaning up the remaining 2018-10-25 d15da881 Apoorva Jindal DELETED goal state cannot be overwritten unless a configuration change is requested 2018-10-25 85cdc696 Zhixin Wen Implement pause and resume for an update 2018-10-25 1b39673c Charles Raimbert Add changelog for 0.7.7.2 2018-10-24 d27697b9 Zhixin Wen Refactor event status update for clarity 2018-10-24 cfbfb90e Zhixin Wen Batch job of INITIALIZED state can be updated 2018-10-24 c04669b2 Varun Gupta Add integration test for stateless revocable job 2018-10-24 7b02b7a3 Anant Vyas Add test for cli output 2018-10-22 8e6f7f65 Apoorva Jindal Implement reducing instance count using DELETED task goal state 2018-10-22 06b9fe7a Sachin Sharma Remove DRAINING hosts from Cluster Capacity calculation 2018-10-23 cc6abbcb Zhixin Wen Add unit test for update 2018-10-22 ae024498 Charles Raimbert Add Changelog for 0.7.7.1 2018-10-17 567375aa Apoorva Jindal Ignore start request for non-running tasks 2018-10-22 cbd95ccd Zhixin Wen Rollback update on failure 2018-10-08 7609789a Anant Vyas Use `HostOfferID` when launching batch and stateless tasks 2018-10-19 f2d4c00c Apoorva Jindal Increase goal state unit test coverage to > 90% 2018-10-22 9b152fb2 Varun Gupta Revert "Restrict the maximum number of updates allowed per job" 2018-10-22 4ff0afe1 Zhixin Wen Add exponential backoff for fail retry 2018-10-19 5af4e111 Sachin Sharma Maintain all host states in Peloton 2018-08-29 972ad664 Varun Gupta Constraint job and task configurations at DB 2018-10-17 3483b06c Aditya Bhave Changelog for 0.7.7 release 2018-10-17 a49299c0 Xiaojian Huang Get rid of glide dependency and fix peloton cli build 2018-10-17 911d6e93 Charles Raimbert Rename label key for system label resource pool 2018-10-16 a8055157 Varun Gupta Converge recent job config version for all batch tasks 2018-10-16 e0d9a0e0 Aditya Bhave Add KAFKA_TOPIC env var to archiver deploy script 2018-10-16 48c29a85 Aditya Bhave Add filebeat_topic to archiver logs 2018-10-16 45d30acb Varun Gupta Make enable revocable resources env as string for vCluster 2018-10-15 3c811458 Anant Vyas Update golint repo path 2018-10-15 857fd802 Sachin Sharma Fix race condition in `RecoverJobsByState` 2018-10-16 f928a335 Zhixin Wen remove test__create_multiple_consecutive_updates temporarily 2018-10-12 875cac28 Varun Gupta Cleanup metrics for Host Manager 2018-10-14 678cc7e1 Apoorva Jindal Not finding the entity in the map after dequeue is not an error 2018-10-11 3eca3ec7 Amit Bose Make it easier to mount local binaries in pcluster 2018-10-12 177c4f6a Varun Gupta Expose revocable resources attribute as env variable 2018-10-12 a23e924d Varun Gupta Aggregate non-leaf resource pool queues size for metrics 2018-10-12 fe212afb Sachin Sharma Modify FillObject to fill an object having fewer fields than that in data 2018-10-12 d0614732 Apoorva Jindal Add v1alpha directory as a copy of v0 api. 2018-10-10 86049071 Amit Bose Clean-up resource-manager tasks stuck in LAUNCHING 2018-10-11 63a9d544 Amit Bose Fix go get for lint 2018-10-10 87c10947 Varun Gupta Validate that revocable job is set preemptible 2018-10-08 588c2068 Zhixin Wen Carry on update progress when task will not be started 2018-10-09 5c95df2a Varun Gupta Skip flaky unit test TestAddRunningTasks 2018-10-08 70364d5f Varun Gupta Add admission control for revocable/slack resources 2018-10-04 e6cd9b74 Sachin Sharma Construct system labels based on new config for Job Update 2018-10-05 de5415c8 Zhixin Wen UpdateStatus includes instances failed 2018-10-04 03d234e0 Sachin Sharma QueryHosts API returns all hosts if request empty 2018-10-04 7747b4c3 Sachin Sharma Remove client-added system labels at the time of job creation 2018-10-04 5d84423c Zhixin Wen Reenable job factory metrics 2018-10-04 06329911 Zhixin Wen Fix InitJobFactory in unit test 2018-10-03 1704b9ba Zhixin Wen Mark update failed if more instances failed during update than configured 2018-09-28 6d2109f0 Amit Bose Add a listener to jobmgr cache 2018-10-03 f5a904e1 Varun Gupta Ignore same status for healthy update 2018-10-02 1cb41fab Zhixin Wen Reduce jobFactory lock contention 2018-10-02 66a0ae3b Zhixin Wen Batch job and stateless job share the same jobStateDeterminer 2018-10-02 5041d016 Apoorva Jindal Restrict the maximum number of updates allowed per job 2018-09-19 893db3e0 Amit Bose Put git commit and build-url in performance test report 2018-09-28 06f0dad1 Anant Vyas Add `HostOfferID` to placements 2018-09-27 a9c754c3 Varun Gupta Add entitlement calculation for revocable/slack resources 2018-10-01 0515c5f4 Varun Gupta Use previous run id from persistent storage on adding shrink instances 2018-10-01 4fb5a0dd Anant Vyas Disable recording state transition durations for `rmTask` 2018-10-01 3cc6b8e5 Varun Gupta Filter pod events by RunID 2018-10-01 f68fe21a Zhixin Wen Comment out runPublishMetrics for job factory 2018-09-28 9ed6d664 Sachin Sharma Add system labels to peloton tasks 2018-09-20 85ff89b4 Anant Vyas Use mimir for `Stateless` task types 2018-09-27 53523e92 Varun Gupta Remove mock datastore test from make unit-test 2018-09-27 5868de05 Varun Gupta Fix flaky unit test TestEngineAssignPortsFromMultipleRanges 2018-09-26 e6901567 Anant Vyas Fix flaky unit test for host summary match filter 2018-09-26 62fe0a69 Sishi Long Add get Peloton Version end point 2018-09-26 f9b7cc65 Varun Gupta Skip flaky unit test for host summary match filter 2018-09-26 b7f193e4 Aditya Bhave Changelog for release 0.7.6 2018-09-26 e72fee79 Zhixin Wen Fix merge error 2018-09-25 7cc62e9d Zhixin Wen JobMgr recovers jobs by job type 2018-09-25 b5a0380f Zhixin Wen Add interation test for stateless job API and fix TaskTerminatedRetry 2018-09-25 789d23e9 Mayank Bansal Disabling Dfrag bin packing 2018-09-25 ebfbd1b4 Zhixin Wen Do not emit error metrics if enqueue gangs fails due to gang already exists 2018-09-13 46138b58 Anant Vyas Add `HostOfferID` to offers from host manager 2018-09-19 63ea4b27 Varun Gupta Add parallelism to add offers by host. 2018-09-21 f97dce10 Zhixin Wen Fix CompareAndSetRuntime build error 2018-09-21 5c55b0db Varun Gupta change cassandra version from 3.9 -> 3.0.14 for local container 2018-09-19 d46222fd Varun Gupta Fix flaky unit test to parse instanceID 2018-09-20 f0509b5e Apoorva Jindal Do not abort existing update in the update svc handler 2018-09-20 42f68372 Apoorva Jindal Set update identifier to nil in job runtime while untracking the update 2018-09-20 242f8a9f Casper Kejlberg-Rasmussen Bump Mimir-Lib 2018-09-20 f0520e24 Casper Kejlberg-Rasmussen Fix command print. 2018-09-20 d4f089f0 Zhixin Wen Implement restart overwrite strategy 2018-09-14 2f917113 Apoorva Jindal Tasks in a terminated batch job cannot be restarted. 2018-09-20 e98e9531 Casper Kejlberg-Rasmussen Fix test setup for a local linux dev laptop. 2018-09-20 67aa83df Varun Gupta Delete Pod Events on Job Delete 2018-08-31 98f4e25e Anant Vyas Add metrics for task state transitions 2018-09-19 7e65b7d6 Varun Gupta Add env variable to enable pod events cleanup 2018-09-19 5873139e Zhixin Wen Add CompareAndSetRuntime to support job write after read 2018-09-18 f23dee8b Apoorva Jindal Fix the build and tests for terminate-retry 2018-09-13 a01839c8 Apoorva Jindal Add support for handling tasks with id greater than instance count 2018-09-17 58c5cc8c Aditya Bhave Improve storage unit test coverage 2018-09-17 1cd8cee8 Mayank Bansal Removing empty test file from YARPC package 2018-09-13 eec2ff96 Chunyang Shen Retry on terminated long running task 2018-09-17 11c5bf49 Varun Gupta Add constraint pod events in Archiver 2018-09-14 e84e2f81 Mayank Bansal Adding Defragmentation bin packing suppport for host manager 2018-09-13 9efed891 Zhixin Wen Move RuntimeDiff and JobConfig from cached to models 2018-09-10 e3c630dd Sachin Sharma Add integration tests for host maintenance 2018-09-12 4702e253 Zhixin Wen Clean up updates and other ops separately 2018-09-13 47778ea7 Zhixin Wen Add job level start and stop support 2018-09-12 abd98dca Zhixin Wen Add changelog for 0.7.5.1 and 0.7.5.2 2018-09-11 7e919e51 Sishi Long Update debug commands and remove pressure test 2018-09-11 22cdbd32 Zhixin Wen Add pod events handle empty desired mesos task id 2018-09-11 a434ec1e Zhixin Wen Fix job runtime updater logged job state 2018-09-10 84d72791 Apoorva Jindal Dump logs when goal state action execute returns an error 2018-09-11 49a0bb1e Anant Vyas Change logrus version to `1.0.0` 2018-09-07 ca5d54df Apoorva Jindal Add support to reduce instance count during an update 2018-09-10 bc9dde0e Aditya Bhave Fix bug in reading job_index table 2018-09-07 27bd512a Apoorva Jindal Desired mesos task id should not be nil when comparing 2018-09-05 a2089b78 Varun Gupta Add parallelism to calculate cluster capacity 2018-09-08 e7e188cd Varun Gupta Add revocable task launch support to host manager 2018-09-07 1669dead Sachin Sharma Set drainer period for development environment 2018-09-06 a741d027 Sachin Sharma Set hostname field in resmgr tasks only if relevant 2018-09-06 959f1ae3 Sachin Sharma Add override url for hostmgr in Peloton Client 2018-09-06 16337918 Zhixin Wen Implement job level Restart 2018-09-04 e445f9c3 Aditya Bhave Reconcile stale job index entries 2018-09-05 6c7ded5e Zhixin Wen Refactor update to support different workflow 2018-08-16 38b12739 Chunyang Shen Set proper health state for task whenever it re-initiate 2018-08-22 89f6f983 Apoorva Jindal Add integration tests for update of a stateless job 2018-08-23 b57babe5 Aditya Bhave Reconcile active jobs which have out of sync mv_task_by_state entry 2018-08-31 c897f141 Zhixin Wen Add job restart proto and cli 2018-08-31 4ebb9a5d Charles Raimbert Add Changelog for 0.7.5 release 2018-08-28 20ebd749 Anant Vyas Refactor preemption package to remove singleton and add tests 2018-08-24 e690de6f Amit Bose Use slave agent IP for fetching logs 2018-08-27 5b5070dc Tengfei Mu Remove unused function in reservation 2018-08-28 e8737c86 Zhixin Wen Use desiredMesosTaskID to replace PREEMPTING state for task preemption 2018-08-28 27cb406c Sachin Sharma Update CLI help text for host maintenance commands 2018-08-27 3e4eda9c Anant Vyas Fix flaky `TestPeriodicCalculationWhenStarted` 2018-08-27 dc9cc781 Zhixin Wen addPodEvent set prevRunID to 0 when PrevMesosTaskId not provided 2018-08-22 795082a7 Anant Vyas Revert "Measure resource pool SLA at a more granular level" 2018-08-22 3f7137db Apoorva Jindal Allow user to provide configuration version when creating a job update in CLI 2018-08-27 1afc3461 Sachin Sharma Auto-refresh maintenance queue periodically 2018-08-27 652df32b Zhixin Wen Refactor mesos task ID generation to CreateMesosTaskID 2018-08-25 4834030e Sachin Sharma Specify only hostname (no IP) for host maintenance 2018-08-23 f9d53386 Zhixin Wen Add desired mesos task id for task and pod events 2018-08-22 fc985aa2 Sachin Sharma Add support to query hosts in UP state 2018-08-23 7156d165 Sachin Sharma Modify AgentMap to have hostname as the key instead of AgentId 2018-08-23 c63e39b8 Sachin Sharma Modify AgentInfoMap to contain more information about the Agent 2018-08-23 f30e5ccb Zhixin Wen Fix job metrics name typo 2018-08-22 1c876206 Zhixin Wen CLI default range.To to MaxInt32 2018-08-21 3ebddc48 Zhixin Wen KILLING state with SUCCEEDED/RUNNING goal state is valid 2018-08-22 ae2be4ed Zhixin Wen Implement task level restart 2018-08-21 2bf1de9d Zhixin Wen Add desired mesos task id in task runtime 2018-08-20 5105b730 Apoorva Jindal Automatically set the configuration version during job update 2018-08-21 14367f4b Sachin Sharma Fix resmgr recovery 2018-08-17 efc54013 Aditya Bhave Cleanup unused storage code, add tests 2018-08-20 7b4f3117 Zhixin Wen WriteProgress validates state change 2018-08-20 584f3b8f Mayank Bansal Adding unit tests in common and statemachine package 2018-08-20 99e1d3f9 Mayank Bansal Fixing flaky unit test 2018-08-16 21ba0564 Apoorva Jindal Adding unit tests for peloton/cli 2018-08-20 69be04f9 Mayank Bansal Adding unit tests for resource manager handlers 2018-08-17 b3f02d38 Chunyang Shen Fix testjob_uber_docker_service 2018-08-20 6b094825 Varun Gupta Add changelog for release 0.7.4 2018-08-20 d7d7d533 Sachin Sharma Fix 'context' of RestoreMaintenanceQueue call in resmgr/recovery.go 2018-08-15 fce6aa8a Anant Vyas Instantiate `finished` channel outside of `NewRecovery` 2018-08-20 f40169dc Charles Raimbert Allow to disable Prometheus while maintaining metric name format 2018-08-09 033a25c8 Chunyang Shen Integrate health check with update process 2018-08-17 345315cc Anant Vyas Increase the memlimit for peloton apps on vcluster 2018-08-16 441d8f85 Zhixin Wen Change "total instance count is greater than expected" to debug 2018-08-15 25158e1d Zhixin Wen Fix update integration test 2018-08-13 51bb605a Sachin Sharma Make `MarkHostsDrained` API call only for 'DRAINING' hosts 2018-08-14 5692da50 Amit Bose Unit tests for yarpc/peer 2018-08-13 bdec6d67 Aditya Bhave Remove executor code from peloton because it is not used 2018-08-14 469ca336 Varun Gupta Update PodEvent protobuf to use mesosTaskID rather pelotonTaskID 2018-08-14 917a3dfc Mayank Bansal Adding tests for resource manager recovery 2018-08-14 38bcda89 Anant Vyas Update logrus version to `^1.0.0` 2018-08-14 9e90a221 Anant Vyas Disable prometheus reporting for resource manager 2018-08-14 ee86a51a Zhixin Wen Add integration test for update 2018-08-13 96e5e498 Aditya Bhave Enable archiver streaming only mode, add unit tests 2018-08-13 7913fe48 Mayank Bansal Removing interfaces from hostmanager reserver and cleanup code 2018-08-01 86fe3b74 Aditya Bhave [API] Set task kill grace period per each task 2018-08-08 de113e5b Zhixin Wen UpdateRun only checks currently updating instances 2018-08-13 c779fb5c Sachin Sharma Restore host->tasks map on resmgr recovery 2018-08-08 84e43889 Varun Gupta Dual write state transitions to pod_events and task_events for batch jobs 2018-08-09 beedd45e Zhixin Wen Check nil for cached job 2018-08-09 535999e5 Anant Vyas Measure resource pool SLA at a more granular level 2018-08-06 3708f611 Chunyang Shen Add UNKNOWN and DISABLED into health state and persist health state into pod_events 2018-08-07 3fef5430 Amit Bose Add unit-tests for package util 2018-08-07 9cdf7d8c Amit Bose Add unit-tests for package common/background 2018-08-08 1efe9f0c Zhixin Wen Terminated task with correct desired configuration is update complete 2018-08-08 9716d32f Sachin Sharma Restore host->tasks map on resmgr recovery 2018-08-06 c26a63d9 Anant Vyas Add tests for `resmgr/server.go` 2018-06-17 5676646e Varun Gupta Improve code coverage for common/logging 2018-07-30 90d32457 Chunyang Shen Update the task healthy field only when event reason is REASON_TASK_HEALTH_CHECK_ST 2018-08-03 33c8fedb Amit Bose Make vcluster run on dca1-preprod01 2018-08-02 637f14ee Zhixin Wen Add unit test for tasksvc start and stop 2018-07-24 b894bd1c Anant Vyas Add tests for reconciler 2018-08-03 6930d70c Anant Vyas Revert "Revert "Record state transition durations for RMTask"" 2018-08-02 3d31c184 Chunyang Shen Add comments for test coverage of JobMgr 2018-07-31 3f07401b Chunyang Shen Add job configuration validation for different type of job 2018-08-02 d9ae280e Varun Gupta Add CLI to get Pod Events 2018-08-02 2de98f4d Sachin Sharma Increase code coverage for hostmgr/reconcile 2018-08-02 82a1770f Zhixin Wen JobRuntimeUpdater treats job with update as non-paritally-created job 2018-08-02 8401ec32 Zhixin Wen Set task goal state to KILLED for terminated task 2018-07-31 eb9956cc Apoorva Jindal Handle errors during update create 2018-08-02 091a8e34 Zhixin Wen Add unit test for tasksvc get 2018-07-31 3d502343 Zhixin Wen Add unit test for jobmgr/tak/placement 2018-08-01 fbe41670 Zhixin Wen Add unit test for volumesvc 2018-08-02 472fda48 Zhixin Wen Add unit test for tasksvc get events 2018-08-01 95786d39 Sachin Sharma Increase code coverage for hostmgr/ 2018-07-31 15908189 Sachin Sharma Increase code coverage for hostmgr/hostsvc 2018-07-31 d6975fa4 Sachin Sharma Increase code coverage for hostmgr/queue 2018-07-31 ffde21c7 Zhixin Wen Implement update pause on server side 2018-07-31 106d92f3 Apoorva Jindal Do not start goal state engines and preemptor before recovery is complete 2018-07-31 3fd8f513 Aditya Bhave Fix resource usage map panic for older tasks 2018-07-27 c22dcee0 Zhixin Wen Add update pause in cli 2018-07-30 d482ccfe Apoorva Jindal Update changelog for 0.7.3 release 2018-07-30 ba36ffce Varun Gupta Add tests for hostmgr/task 2018-07-26 e2b779a2 Zhixin Wen Enqueue initialized tasks being updated into goal state engine 2018-07-30 95b84820 Zhixin Wen Reenqueue update if any task updated in UpdateRun is KILLED 2018-07-27 3f1cf634 Varun Gupta Add tests for peloton/placement 2018-07-26 e96f1b59 Amit Bose Stop asyncEventProcessor on statusUpdate stop 2018-07-24 e376004f Chunyang Shen Add test cover for jobmgr/task/event/statechanges 2018-07-26 5e8fe7e1 Chunyang Shen Add test cover for jobmgr/logmanager 2018-07-28 9646bf2f Varun Gupta Remove MySQL as Storage for interim 2018-07-27 6443cc3b Varun Gupta Add tests for placement/tasks 2018-07-27 1121e55a Varun Gupta Add tests for placement/models 2018-07-27 9f9ddb71 Zhixin Wen Merge instance config when update config 2018-07-26 1a535a35 Varun Gupta Add tests for placement/offers 2018-07-26 7aaaf96a Aditya Bhave Change gocql version in glide.yaml 2018-07-11 317d5e2f Sachin Sharma Add host related CLI commands 2018-07-25 db611863 Zhixin Wen Implement update with batch size 2018-07-26 4ec15b89 Anant Vyas Fix potential deadlock case in `AggregatedChildrenReservations` 2018-07-26 cfd609fb Mayank Bansal Implementing Host Manager's ReserveHosts and GetCompletedReservations api's in plac 2018-07-24 d9fa8604 Chunyang Shen Add test cover for jobmgr/task/deadline 2018-07-25 7f22e47f Chunyang Shen Add test cover for jobmgr/goalstate 2018-07-19 cf755e6e Sachin Sharma Add Drainer to reschedule tasks placed on draining hosts 2018-07-19 d299bd3f Sachin Sharma Add Drainer to reschedule tasks placed on draining hosts 2018-07-23 93febbeb Aditya Bhave Handle resource usage calculation during jobmgr recovery 2018-07-24 566b94a8 Zhixin Wen Task and job return yarpc not found error in store 2018-07-13 8c33be91 Anant Vyas Fix the deadlock in task state machine and tracker 2018-07-20 73fe6a84 Sachin Sharma Modify Start() and Stop() to use common/lifecycle 2018-07-24 0c8e3fb2 Mayank Bansal Fixing errors, consistent locking in code cleanup in placement engine, resmgr and H 2018-07-23 206d7bfa Chunyang Shen Fix building Peloton-CLI debian package 2018-07-24 07240c9f Chunyang Shen Persist the health check result into the task runtime 2018-07-05 97c76252 Aditya Bhave Add resource usage stats to job and task runtime info 2018-07-20 c464b138 Anant Vyas Revert "Record state transition durations for RMTask" 2018-07-19 38368bc2 Mayank Bansal Adding support in placement engine for processing completed reservation and create 2018-07-18 e4e58fa6 Chunyang Shen Fix `Less` function in task sorting by `creation_time` 2018-06-19 3464dff2 Aditya Bhave Add unit tests for jobmgr task util 2018-07-19 146fae0f Mayank Bansal Refactor testutil package and add tests for it 2018-07-20 80e39cd5 Zhixin Wen Recover Update upon JobMgr restart 2018-07-19 de40941a Apoorva Jindal Allow providing resource pool path in update create CLI 2018-07-19 78b68786 Apoorva Jindal Add support for AbortUpdate API 2018-07-19 1a8ce6e0 Apoorva Jindal Move the changelog version increment and validation to the cache 2018-07-19 f79b8338 Zhixin Wen Fix nil panic when update cache is not found 2018-07-19 dc2dab83 Zhixin Wen Implement update.Recover 2018-07-11 d75b277e Apoorva Jindal Ensure that previous uncompleted job updates are aborted correctly 2018-07-17 7a8c07fe Sachin Sharma Prevent duplicate entries for same Peloton task in Preemption Queue 2018-07-16 63454f85 Apoorva Jindal Add Changelog for 0.7.2 release 2018-07-16 f477c611 Zhixin Wen Change MockJobConfig to MockJobConfigCache 2018-07-11 e880cfa9 Apoorva Jindal Add update service handler and CLI to interact with it 2018-07-13 79294b8d Zhixin Wen Controller job decides terminal state with controller task state 2018-07-13 5dae225a Zhixin Wen Validate controller task config 2018-07-12 bd68c51b Chunyang Shen Disable the periodic GetActiveTask call from JobMgr to ResMgr 2018-07-10 175973ba Aditya Bhave Move retain base secrets code into task config Merge 2018-07-11 7f4d2eba Chunyang Shen Script to send a report Email after performance test 2018-07-05 9975ac8f Apoorva Jindal Add cache and goal state engine for job updates 2018-07-11 839df75d Mayank Bansal Adding Reserver in HostManager which can reserve the hosts for specified tasks 2018-07-11 f05f7a8e Anant Vyas Fix flaky TestPeriodicCalculationWhenStarted 2018-07-10 5ade2bd3 Chunyang Shen Remove Invalid StartTime log 2018-07-10 731041f5 Zhixin Wen Quick fix of cache deadlock 2018-07-10 89e48eb4 Apoorva Jindal Move secrets related stuff used in common from jobmanager to util 2018-06-19 81d4cd27 Chunyang Shen Enable GetActiveTasks() in job manager through a in-memory DB 2018-06-15 5cf37f9b Anant Vyas Increase test coverage for rmtask 2018-07-09 6e616c07 Zhixin Wen Service job cannot be in FAILED/SUCCEED state 2018-07-09 7e179865 Aditya Bhave Changelog for 0.7.1.3 2018-07-06 216215fa Aditya Bhave Instance config should retain secrets from defaultconfig 2018-07-05 e24f25cc Zhixin Wen Share integration test cases between stateless and batch job 2018-07-05 8b0e964c Aditya Bhave [API] Add resourceUsage map to RuntimeInfo proto 2018-07-05 cef8856f Zhixin Wen Goal state engine handles job update 2018-07-05 8fb21d19 Zhixin Wen JobMgr Read API stops refilling cache 2018-06-28 284d2c38 Apoorva Jindal Add DB read and write operations for job updates 2018-07-05 f2a4dd9a Apoorva Jindal Remove unused cached.UnknownVersion 2018-06-12 56970a98 Apoorva Jindal Add support for upgrading a task in the task goal state engine 2018-07-03 efcacc03 Zhixin Wen Clean up jobmgr code 2018-07-03 d189dd94 Zhixin Wen Replace UpdateTasks with PatchTasks 2018-07-03 d2f0ffc4 Zhixin Wen Add task actions for stateless jobs 2018-07-03 3f199d90 Aditya Bhave Add changelog for 0.7.1.2 release 2018-07-01 2d37169e Aditya Bhave Don't enforce instance config to use mesos containerizer for secrets 2018-07-02 7a44efff Zhixin Wen Stateless jobs do not call taskStore.GetTaskEvents 2018-06-29 1321ef44 Zhixin Wen Set goalstate to PREEMPTING for tasks to be preempted 2018-06-28 02b02042 Chunyang Shen Fix peloton-client library in vcluster 2018-06-26 b8ab5acf Apoorva Jindal [API] Fix APIs for supporting job updates 2018-06-28 aab934a0 Zhixin Wen Define RuntimeDiff to replace map[string]interface{} in runtime update 2018-06-28 755d9e9c Varun Gupta Add set API for pod events. 2018-06-27 b71ed5f4 Zhixin Wen Handle job runtime update when more instances running than expected 2018-06-28 73c8a85a Chunyang Shen Remove GetJobConfig from GetTasksByQuerySpec 2018-06-27 3227ddc7 Aditya Bhave Base64 decode data before launching task 2018-06-27 9b2fc07f Varun Gupta Add incremental run id 2018-06-27 4f15791f Zhixin Wen Clean job upon create failure only if job does not exist before 2018-06-25 3f87e90c Zhixin Wen Add stateless PE to deploy script 2018-06-26 3f330e4d Mayank Bansal Adding Test Cases for Tree as well remove duplicate code 2018-06-25 c712b42d Zhixin Wen Use PatchTasks in place of UpdateTasks for simple cases 2018-06-25 c3bb7987 Zhixin Wen Add stateless PE to pcluster 2018-06-25 784c819d Zhixin Wen set TaskType to STATELESS for stateless service 2018-06-25 d25bd862 Zhixin Wen Revert "Add stateless PE to pcluster" 2018-06-25 379e15f1 Zhixin Wen Add stateless PE to pcluster 2018-06-25 97168c60 Mayank Bansal Refactoring respool handler , add more tests as well as remove duplicate code 2018-06-25 4ac65064 Varun Gupta Add task.RuntimeInfo as blob, and update runID as bigint datatype 2018-06-21 a21f23c6 Zhixin Wen use SLA instead of Sla 2018-06-19 f7093dfb Chunyang Shen Fix TaskList crash when job is not in the cache 2018-06-21 9f0287ba Varun Gupta Ignore scarce_resource_tpyes if empty 2018-06-21 87699b09 Mayank Bansal Adding Interface for MultiLevel List as well add tests in queue package 2018-06-21 22418051 Mayank Bansal Adding tests for scalar package as well as remove duplicate code from tests 2018-06-21 b7eea49b Aditya Bhave set env variable ENABLE_SECRETS by reading from cluster config 2018-06-21 e7873680 Zhixin Wen Use ReplaceTasks in place of UpdateTasks with UpdateCacheOnly 2018-06-21 997f41a3 Zhixin Wen Implement PatchTasks and ReplaceTasks 2018-06-21 5f60fc07 Zhixin Wen Implement PatchRuntime and ReplaceRuntime 2018-06-21 f845ca55 Aditya Bhave Fix integ and perf tests to use 0.7.1 py client 2018-06-20 edb372ce Charles Raimbert Add changelog for 0.7.1.1 2018-06-19 38ae7e9f Mayank Bansal Remove duplication from test code as well as separating test case. 2018-06-18 f4b11ef3 Varun Gupta Change datatype from timestamp to timeuuid 2018-06-15 d07c7372 Varun Gupta Add scarce_resource_types to prevent CPU only task to launch on GPU machines 2018-06-18 d19cf933 Charles Raimbert Set Mesos Task Labels for JobID, InstanceID, TaskID 2018-06-15 0ea48583 Mayank Bansal Changing resource manager api from MarkTasksLaunched to UpdateTasksState for updati 2018-06-15 b8ed208d Aditya Bhave Improve unit test coverage for jobsvc 2018-06-15 7be5e73b Varun Gupta Add task events table 2018-06-14 ef6c20aa Zhixin Wen Add unit tests for async processor 2018-06-14 761bf5bb Varun Gupta Reject offers if Unavailability set after maintainence start time 2018-06-14 008b09ea Varun Gupta Restrict Max Retries on Task Failures. 2018-06-14 dc670413 Min Cai Add changelog for 0.7.1 2018-06-08 a0c17697 Chunyang Shen Remove GetJobConfig from DB in TaskQuery and TaskList 2018-06-13 080165de Min Cai Add changelog for 0.7.0 2018-06-13 419c3746 Aditya Bhave Add secrets log formatter to redact secret data in logs 2018-06-13 39806a3e Mayank Bansal Fixing flaky test case for Entitlement Calculation 2018-05-25 9976b6db Anant Vyas Record state transition durations for RMTask 2018-06-13 5fc16ca7 Mayank Bansal Refactoring GetHosts in HostManager as well as some code cleanup 2018-06-13 d14e7b2a Mayank Bansal Adding Reserver in placement engine for host reservation 2018-06-12 605616ae Zhixin Wen Refactor task_test.go to use test suite 2018-05-30 39d5f147 Varun Gupta Refine Cassandra Table Attributes 2018-05-31 dfaa105d Varun Gupta Increase placement engine worker threads 2018-05-30 2b9957b1 Aditya Bhave Job Update/Get API now supports secrets 2018-06-08 a9cf51c2 Min Cai Move v0 Peloton API to protobuf/peloton/api/v0 2018-06-08 cf94ceb8 Zhixin Wen Update changeLog version to max version plus one when update job config 2018-06-04 dd650f00 Anant Vyas Fix deadlock in task tracker 2018-06-07 d109016a Apoorva Jindal Make job recovery failure a fatal error 2018-06-07 2e1949e8 Zhixin Wen Add API to query job and task cache 2018-06-06 fcf1419b Zhixin Wen Job runtime and config read gets data from cache when possible 2018-06-06 340d94c3 Anant Vyas Format code using `gofmt` 2018-06-06 1eb83b06 Varun Gupta Make errChan buffered eq to len of jobsByBatch to prevent gorountine leak on recove 2018-05-25 5877b4a0 Aditya Bhave Fix populateSecrets bug when launching tasks from jobmgr 2018-06-05 97003f1f Zhixin Wen Partially created job set to INITIALIZED and enqueue to goalstate engine 2018-05-29 43d722ce Chunyang Shen Kill the orphan Mesos task before regenerate a new MesosTaskID 2018-05-31 03cafcdc Varun Gupta Add more logs for placement engine 2018-06-01 1d6337d9 Zhixin Wen Remove statusUpdaterRM 2018-06-04 0f146781 Apoorva Jindal Add/cleanup logs to track stuck tasks after failure to launch in job manager 2018-06-04 c664f0da Zhixin Wen All job config and runtime update go through cache 2018-06-01 ae8eec51 Varun Gupta Fix async pool test & minor code refactoring 2018-05-30 c9e2e6a1 Apoorva Jindal Reconcile jobs and tasks in KILLING state as well 2018-05-21 8d08fab3 Varun Gupta Add stop feature for async pool jobs 2018-05-29 0dbbd7ce Varun Gupta Add host manager API support in pCluster for cli 2018-05-29 b756812d Apoorva Jindal Fix GetJobConfig to use the correct configuration version 2018-05-23 402a7235 Apoorva Jindal Add metrics to goal state to help debugging 2018-05-24 9fbb662f Mayank Bansal Making consistent function calls in entitlement calculator and reducing duplicate c 2018-05-24 1a9f2a3a Apoorva Jindal Update changelog for 0.6.14 2018-05-24 55fb5ce9 Zhixin Wen Big int in Cassandra casts to int64 Cast big int in Cassandra to int64 upon read 2018-05-24 b1636d01 Apoorva Jindal Add Changelog for 0.6.14. 2018-05-24 ddeb14da Apoorva Jindal Temporary fix to fail write API calls in a non-leader job manager 2018-05-18 d07b4b80 Apoorva Jindal Fix race in the deadline queue unit tests 2018-05-16 bb007a49 Zhixin Wen Add debug log 2018-05-22 65e46064 Apoorva Jindal Add EnqueueJobWithDefaultDelay helper function to goal state 2018-05-24 9d62ca83 Charles Raimbert Add pit1-prod01 cluster to integ tests config 2018-05-23 a478a674 Varun Gupta Add CLI to get outstanding offers 2018-05-08 f0d476cf Aditya Bhave Peloton Secrets first cut 2018-03-30 4b6603b5 Aditya Bhave Peloton archiver code to query and delete jobs 2018-05-22 d3bb2f6a Apoorva Jindal In task start, enqueue job with a batching delay 2018-05-23 a90da8b9 Mayank Bansal Adding TestUtils for tests to remove code duplication as well as fixing comments 2018-05-18 eb4c6a7a Varun Gupta Add tests for internal hostmanager API service 2018-05-21 6aa532cc Apoorva Jindal Enable log rotation in thermos for peloton daemons 2018-05-22 19773a8a Apoorva Jindal Reduce number of EnqueueJob calls 2018-05-22 927a5070 Mayank Bansal Removing duplication of the code and more consistent logging 2018-01-31 58a50095 Apoorva Jindal Use the version in job configuration protobuf to track job version 2018-05-18 1b7cbf33 Varun Gupta Add Internal API to expose all offers 2018-05-17 56b43841 Mayank Bansal Adding GetHosts api from hostmanager to returns the list of hosts which matches the 2018-05-11 4600e5aa Aditya Bhave Add job query by time range integ test 2018-05-17 316ba0f2 Varun Gupta Stop dispatcher in engine and minor refactoring 2018-05-16 990764d7 Apoorva Jindal All task runtime read calls to DB should be sent to cache 2018-05-16 788b899d Zhixin Wen Add action for invalid job and task state 2018-03-29 f2e7af25 Apoorva Jindal Implement write through cache for task runtime in job manager 2018-05-15 add336e1 Varun Gupta bump peloton-client version 2018-05-15 c8ae495d Zhixin Wen Add protobug to integration test requirements.txt 2018-04-11 0b3319e8 Anant Vyas Add integration test for non-preemptible jobs 2018-05-15 47b3b808 Zhixin Wen Add STARTING task state to goal state 2018-05-14 d8d2f3f8 Varun Gupta Upgrade mesos version in pcluster and mesos proto files to version 1.6.0-rc1 2018-05-14 f1391f52 Anant Vyas Fix locking of RMTask when transitioning during scheduling 2018-05-14 86034a8b Zhixin Wen Fix flaky unit test in TestBucketEventProcessor 2018-04-19 7c988e08 Chunyang Shen Add task fail reason in the metrics 2018-05-10 9be7003d Chunyang Shen Fix debian package build script 2018-05-11 7be1449e Varun Gupta Offer Pool sentry logs, bug fixes, refactoring & unit tests. 2018-05-11 a11870ef Zhixin Wen Fix flaky unit test in observer 2018-05-11 9908d2e8 Zhixin Wen Event stream client functions normally after restarts 2018-05-10 1a6a518a Varun Gupta Add revocable resource support 2018-05-10 f0316bda Varun Gupta Add QoS controller to pCluster 2018-05-10 2038b872 Zhixin Wen Preempt ignores task with KILLED goal state 2018-05-09 88efade3 Chunyang Shen Add Changelog for 0.6.13 2018-05-03 db4fca3e Chunyang Shen Add filtering `name` and `host` for task query 2018-05-08 56de3445 Zhixin Wen Add KILLING state for task 2018-05-08 3625170a Mayank Bansal Deleted tasks from the gangs needs to be removed from gang and enqueued again 2018-03-28 945772f6 Anant Vyas Add option to show progress for `job stop` cli action 2018-05-07 a1bb62f5 Mayank Bansal Making placement backoff feature enable/disbale via config 2018-05-03 cf783bf4 Anant Vyas Update `respool dump` cli to honor `-j/--json` flag 2018-05-03 d843ba61 Zhixin Wen Add job config validatation rule and refactor 2018-05-02 e6dc9827 Varun Gupta Bug fixes at host summary, refactoring and unit tests. 2018-05-03 9d77b70c Mayank Bansal Adding Placement backoff support in Peloton 2018-05-03 34d9554b Apoorva Jindal Check for cached job and task to be non nil before using it 2018-05-01 7e069495 Aditya Bhave Upgrade mesos version in pcluster and mesos proto files to version 1.5.0-rc2 2018-04-16 f466fee8 Anant Vyas Update `GetPendingTasks` API to include `NonPreemptible` queue 2018-04-04 5390f4d5 Apoorva Jindal Always evaluate tasks during recovery except for onces recovered by job recovery 2018-05-02 061bcaec Zhixin Wen Update version of ledership and libkv package 2018-05-02 404aa7f4 Apoorva Jindal Update CHANGELOG for 0.6.12.1 2018-05-01 65529cdb Varun Gupta revert protoc-gen-go version 2018-04-26 4c8caaba Aditya Bhave Add job delete integration tests 2018-05-01 6fa6639e Apoorva Jindal Revert "fix unit test broken by revert of 4533a25" 2018-05-01 aae294e9 Varun Gupta Fix build issue caused by incompatible protoc-gen-go binary. 2018-04-27 fb51bdd9 Zhixin Wen Add integration test for Start and Stop Task API 2018-04-27 47d63cd5 Apoorva Jindal Fix missing TTL error when orphan tasks are killed 2018-04-23 29b30b1d Chunyang Shen Add sorting for task query 2018-04-27 1af1e0f6 Zhixin Wen Fix flaky unit test 2018-04-27 f1c86225 Sachin Sharma Modify preemption queue type (from RMTask to PreemptionCandidate) 2018-04-26 2525a1b0 Varun Gupta Update browse sandbox API to return mesos master hostname and port 2018-04-26 5622a8bd Sachin Sharma Add hostmgr recovery to restore the contents of maintenance queue 2018-04-25 12161785 Zhixin Wen Make candidate resign leadership if GainedLeadershipCallback fails 2018-04-11 82765d01 Varun Gupta Add Configurable FrameworkInfo Capability 2018-04-24 28ce994f Sachin Sharma Add HostService handler code 2018-04-23 1ad85be6 Varun Gupta Bump go version to 1.10 & fix jenkins build. 2018-04-23 13565d73 Varun Gupta Add Offer Type for feedback. 2018-04-10 5a190012 Varun Gupta Add retry for reconciler and explicit reconciliation on hostmgr or mesos master re- 2018-04-23 a9f3e322 Varun Gupta Add Get Mesos Master Host and Port Endpoint to Hostmgr Internal Service. 2018-04-20 9f0820e5 Antoine Pourchet Added more executor tests 2018-04-23 b04d16b0 Zhixin Wen Fill in reason from ResoureManager for PENDING tasks 2018-04-23 04e915b0 Anant Vyas Update changelog for 0.6.12 2018-04-19 385e2e57 Anant Vyas Hide admission of non-preemptible jobs behind a flag 2018-04-20 8c67cc72 Mayank Bansal Checking mesos taskId before removing task from tracker 2018-04-19 43d434f1 Tengfei Mu Enable Aurora health check for Peloton 2018-04-19 892d4d20 Tengfei Mu Update changelog for 0.6.12 2018-04-19 cec2b17c Mayank Bansal Fixing race condition between removing task from tracker and adding the same task w 2018-04-19 a75768ea Zhixin Wen Update health.leader when candidate is not leader 2018-04-19 1bf37a08 Sachin Sharma Add Host APIs 2018-04-18 15593ae5 Sachin Sharma Add comment for channel 'finished' in resmgr/recovery.go 2018-04-18 ee4aa1c4 Zhixin Wen fix unit test broken by revert of 4533a25 2018-04-18 3dcd8586 Zhixin Wen eventstream client send correct purgeOffset upon restart 2018-04-18 08efd127 Zhixin Wen unset completion time when task is running 2018-04-17 2ff81276 Aditya Bhave Revert "Revert "Add 100k task per job limit to master code"" 2018-04-17 ac761d07 Tengfei Mu Retry Do not recover FAILED jobs till archiver is committed 2018-04-17 4533a259 Tengfei Mu Revert "Rearchitect the job manager to use the cache and the goal state engine" 2018-04-17 ea242723 Tengfei Mu Revert "Do not recover FAILED jobs till archiver is committed." 2018-04-17 562cfa54 Tengfei Mu Revert "Add 100k task per job limit to master code" 2018-04-17 e422864e Tengfei Mu Revert "Fix completion time for jobs moving from PENDING to KILLED" 2018-04-16 4ff37165 Aditya Bhave Fix completion time for jobs moving from PENDING to KILLED 2018-04-16 8237bdb9 Chunyang Shen Add max_retry_attempts for test__create_job to pass smoketest 2018-04-12 72b606f2 Aditya Bhave Add 100k task per job limit to master code 2018-04-13 61faa9b9 Zhixin Wen enable host tags for metrics 2018-04-10 1c158540 Aditya Bhave Bump up C* timeouts and add timers to recovery code 2018-04-12 30893d02 Sachin Sharma Add Host Maintenance API 2018-04-11 b2333a15 Aditya Bhave Change GC and compaction for tables with large partitions 2018-04-10 a50f8141 Mayank Bansal Adding errorcodes in communication between resmgr and jobmgr for enqueuegangs 2018-04-10 80795a9f Zhixin Wen fix potential memory leak in priorityQueue 2018-03-28 af1c4765 Anant Vyas Make preemptor aware of non-preemptible tasks 2018-03-26 6a286852 Anant Vyas Admission control for non-preemptible gangs 2018-04-09 5783867d Zhixin Wen remove unused api.ResultSet to pass lint 2018-04-09 3c4cb0f5 Varun Gupta Reconcile Staging Tasks 2018-04-05 669ab920 Chunyang Shen Add script to do performance comparison betwwen two versions 2018-04-04 f789056b Chunyang Shen Push to registry docker-registry02-sjc1:5055 2018-04-04 31d3fc86 Apoorva Jindal Do not recover FAILED jobs till archiver is committed. 2018-04-03 e4d186ff Chunyang Shen Fix docker build script and update ATG registry 2018-03-22 e73c573f Apoorva Jindal Rearchitect the job manager to use the cache and the goal state engine 2018-04-02 0812b252 Apoorva Jindal Add a log when transient DB error occur on the hostmgr eventstream path 2018-03-29 afe64575 Apoorva Jindal Fix resmgr reason for state transition 2018-04-02 93e33734 Chunyang Shen Update Glide installation in Makefile 2018-03-26 3afa33c6 Anant Vyas Don't log UUID in sentry error 2018-03-04 b4a1c687 Apoorva Jindal Add a common library to implement a goal state engine 2018-03-23 92b38552 Charles Raimbert Rename metric tag from `type` to `result` for success/fail 2018-03-22 aa75dc11 Aditya Bhave Delete job_index entry as part of DeleteJob 2018-03-20 f0f0dd7a Apoorva Jindal Address remaining review comments on in-memory DB 2018-03-21 2e75317a Aditya Bhave Copy over protoc includes in build script 2018-03-21 eff71b0d Anant Vyas Add changelog for 0.6.11 2018-03-21 4d8ef758 Charles Raimbert Pin down YARPC version in glide to avoid `uber fx` 2018-03-21 6af36b7f Charles Raimbert Use patched docker/libkv for ZooKeeper Leader Election 2018-03-21 549abf4d Anant Vyas Use long running job fixture for `test__stop_long_running_batch_job_immediately` 2018-03-20 9f1b6f60 Sachin Sharma Modify GetTasksForJobAndStates to accept []TaskState parameter instead of []string 2018-03-15 463a517b Aditya Bhave Add integration test for Job Query API 2018-03-16 00d2be77 Apoorva Jindal Do not update the state transition reason on dequeue from placement engine 2018-03-15 5b280782 Apoorva Jindal Correct scheduled task accounting in case of launch errors for maxRunningInstance f 2018-03-04 c36db8f5 Apoorva Jindal Add cache to job manager. 2018-03-19 c54d6c25 Mayank Bansal Adding support for static respool in Tree hierarchy and Entitlement 2018-03-12 9261ce0c Aditya Bhave Add support to query jobs by timerange 2018-03-08 d2f517bd Chunyang Shen Be able to teardown vcluster in any fail in launching or testing vcluster 2018-03-14 ba15886c Aditya Bhave Add runtime info to jobquery cli output 2018-03-14 4aa4a4e1 Apoorva Jindal Always evaluate a job for maxRunningInstaces SLA irrespective of job runtime update 2018-03-13 1b07391d Mayank Bansal Adding Static reservation type in to resourcepool config 2018-03-08 7ac84ca2 Anant Vyas Add integration tests for controller task 2018-03-06 f864d2a3 Chunyang Shen Add a monitor job for vcluster to send data to M3 2018-03-08 4f075db5 Apoorva Jindal Enable integration test for fetching logs of previous task runs of failed task 2018-03-09 8d4afb5c Mayank Bansal Dividing entitlement calculation to phases and adding more tests to entitlement 2018-03-09 ae4c6225 Apoorva Jindal Add changelog for 0.6.10.5 2018-03-07 6e5676a3 Apoorva Jindal Do not overwrite killed state for partially completed jobs 2018-03-08 c0cb84d0 Sachin Sharma Add 'task query' command to CLI to query on tasks(for a job) by state(s) 2018-03-07 2359dacc Anant Vyas Fix race condition in state machine rollback 2018-03-02 78b3c2f9 Anant Vyas Change controller task interface between jobmgr and resmgr 2018-03-07 33a5029e Tengfei Mu Revert range map change 2018-03-06 98c36e28 Aditya Bhave CHANGELOG for 0.6.10.4 2018-03-06 455e7771 Aditya Bhave Remove 7 day time span restriction from querying active jobs 2018-03-05 d34affa5 Apoorva Jindal CHANGELOG for 0.6.10.3 2018-03-05 c0db0380 Apoorva Jindal Handle incomplete killed jobs 2018-03-04 4793be72 Charles Raimbert Add ability to not deploy watchdog 2018-03-02 d29c91af Charles Raimbert Update Changelog for release 0.6.10.2 2018-03-02 ea3f3dce Anant Vyas Revert DequeueGang to get CONTROLLER task as well 2018-03-01 11875d8e Anant Vyas Terminate the statemachine when a task is removed from the tracker 2018-03-01 72a86912 Charles Raimbert Update Changelog for release 0.6.10.1 2018-02-28 99671804 Charles Raimbert Revert "Add 'task query' command to CLI to query on tasks(for a job) by state(s)" 2018-02-28 1a15dee2 Charles Raimbert Update Changelog for release 0.6.10 2018-02-28 4bd1e8d0 Anant Vyas Bump peloton apps mem limit to 16GB 2018-02-28 5ccfc7d9 Apoorva Jindal Enable log rotation for Peloton containers 2018-02-28 0cc9cf7e Sachin Sharma Add 'task query' command to CLI to query on tasks(for a job) by state(s) 2018-02-28 59da5458 Anant Vyas [API] Extend pending tasks API to work with controller queue 2018-02-15 18185907 Tengfei Mu Bump thrift version for peloton deployment tool 2018-02-27 6ebd6eb5 Aditya Bhave JobQuery optimization 2018-02-27 c1a7c106 Anant Vyas Fix updating a nil controller limit metric 2018-02-27 a46704a0 Tengfei Mu Change peloton deploy script to honor apps deployment order 2018-02-27 762c0b4b Anant Vyas Remove unused import 2018-02-27 c5f87e3f Charles Raimbert Fix periodic leader election metrics 2018-02-26 f90c2f6c Apoorva Jindal [API] Add API and CLI to fetch active and pending tasks from resource manager 2018-02-25 5d146090 Anant Vyas Controller tasks scheduling and admission : part two 2018-02-27 b15e692c Tengfei Mu Expose placement reason to resmgr for state tracking 2018-02-27 7135a0e0 Charles Raimbert Retry connection with ZooKeeper if connection dropped 2018-02-27 7b5eef10 Charles Raimbert Fix pcluster for ZK stability 2018-02-21 bbd6abaf Apoorva Jindal Persist FAILED task state into DB even if task needs to be restarted 2018-02-23 22498469 Apoorva Jindal Store reason for a state transition in the state machine 2018-02-23 b5a31b2a Anant Vyas [API] Add controller limit in the resource pool config 2018-02-22 08e31747 Aditya Bhave API: Add summaryOnly flag to QueryRequest 2018-02-22 1f370b1d Anant Vyas Refactor allocation accounting in resource manager 2018-02-21 4e422771 Apoorva Jindal Change job runtime updater run interval for batch to be 10s and recover jobs in KIL 2018-02-20 82ddd1d0 Anant Vyas Controller tasks scheduling and admission : part one 2018-02-16 b74dab97 Apoorva Jindal Add support for fetching task logs for previous task runs 2018-02-22 863047fa Chunyang Shen Create respool based on the size of vcluster 2018-02-22 726b2c9a Min Cai Clean up old Aurora job files for Peloton deployment 2018-02-16 7e20f3ea Min Cai Add a separate Makefile target to generate API docs 2018-02-22 c5ff1ee1 Charles Raimbert Emit Leader Election `is_leader` metrics every 10 secs 2018-02-21 11296b25 Anant Vyas [API] Add Controller task type in the peloton API 2018-02-20 744ae534 Aditya Bhave Fix JobSummary for old jobs, add job query by name 2018-02-15 7d78f8bb CJ Ketchum Add maintenance mode to Peloton using Mesos maintenance primitives 2018-02-21 5767ffa6 Charles Raimbert Fix ZooKeeper Leader Election metrics 2018-02-21 ce560d7b Anant Vyas Revert allocation metric name 2018-02-16 b79bd8ab Apoorva Jindal Add API to be able to fetch different task (runs) of an instance 2018-02-20 5fb44df1 Aditya Bhave Fix sorting in job query response 2018-02-14 63e1aa4c Aditya Bhave [API]: add job query by owner, wildcard search 2018-02-16 ffca81ff Apoorva Jindal Merge branch master into release 2018-02-16 8cfb12de Anant Vyas Fix `allocation` metric scope name 2018-02-08 49b7a7cb Chunyang Shen Add script for run multiple testing 2018-02-14 480d5c34 Anant Vyas Update root resource pool `limit` with cluster capacity 2018-02-14 d9291ed0 Tengfei Mu Update Changelog for 0.6.9 release 2018-02-14 6755cbba Apoorva Jindal Log debug (not error) messages from job/task read handlers 2018-02-02 5deced33 Anant Vyas Track allocation of non-preemptible tasks separately 2018-02-08 e5a18c6c Aditya Bhave Add archiver component to peloton 2018-02-08 52606f09 Apoorva Jindal Batch tasks being sent to resource manager for maximum running instances feature. 2018-02-12 03b874e7 Anant Vyas Fix demand calculation in entitlement calculator 2018-02-12 3040bf6b Anant Vyas Update release version in engdocs 2018-02-08 a9d6fa05 Anant Vyas Add integration test for job update RPC 2018-02-08 e790d27f Apoorva Jindal Evaluate and update job state even if the task stats have not changed. 2018-01-30 02cded6f Apoorva Jindal Add refresh API for both job and task 2018-02-09 c895ea5a Antoine Pourchet Implemented core executor code 2018-01-26 36f83591 Apoorva Jindal A task in launched state times out in job manager 2018-02-07 8feff74f Apoorva Jindal Handle kill for a job which has not fully created all tasks. 2018-02-05 2578fc79 Aditya Bhave Send JobSummary in Job Query Response 2018-01-11 d32b7134 Apoorva Jindal Run job runtime updater as part of job goal state 2018-02-07 34736b9f Apoorva Jindal Handle initialized tasks with goal state set to be failed. 2018-01-16 aef0bf24 Anant Vyas Add GetPendingTasks API in resource manager 2018-02-07 6c3c3d65 Apoorva Jindal Drop the old lucene index and the unused upgrades table in the next migration. 2018-02-07 1250d4a5 Aditya Bhave Fix Down Sync migration script for update_info 2018-02-06 9bcc2faa Apoorva Jindal Add changelog for 0.6.8.2. 2018-02-05 5a53db98 Apoorva Jindal Untrack failed tasks with goal state succeeded. 2018-02-05 83278dc6 Mayank Bansal Releasing 0.6.8.1 2018-02-02 b3bd6169 Aditya Bhave Fix migrate script for job_index 2018-02-02 3cb4fdde Mayank Bansal Adding changelog for 0.6.8 release 2018-02-02 c9e4e0be Mayank Bansal Removing race between different transitions in state machine 2018-02-02 fd31598e Mayank Bansal Adding mesos quota support in cluster capacity call for host manager 2018-01-31 d5bdf5a3 Aditya Bhave Schema and DB change to speed up JobQuery 2018-02-02 436c657b Mayank Bansal Adding Limit support for resource pools 2018-02-01 35042b0b Mayank Bansal Adding the api doc in docs folder 2018-01-31 53312b8a Mayank Bansal Adding apidoc in docs folder from build 2018-01-31 7eaeea31 Mayank Bansal Adding peloton engdocs 2018-01-31 8aaa06fb Mayank Bansal Adding peloton engdocs 2018-01-02 ebc56c36 Anant Vyas Add extra logging in state machine implementation 2018-01-31 f402722c Mayank Bansal Changing api docs to html format 2018-01-25 636a4aca Apoorva Jindal Ignore failure event due to duplicate task ID message from Mesos 2018-01-26 dc476d3d Apoorva Jindal Send kill of PENDING tasks to resource manager 2018-01-24 9d9b9d8f Apoorva Jindal Send initialized tasks during recovery as a batch to resource manager 2018-01-24 8fc8ab4e Zhitao Li Guard against any case when hostname may be missing in offer pool. 2018-01-22 a28e6217 Chunyang Shen Add Script for performance test running 2018-01-11 ec44c2f3 Apoorva Jindal Fix sorting based on creation/completion time in job query 2018-01-24 b3b0fb00 Apoorva Jindal Do not run job action with a context timeout. 2018-01-23 0356c21b Apoorva Jindal Revert "Temporarily, do not recover initialized tasks in non-initialized jobs in jo 2018-01-08 30e582af Chunyang Shen shutdown executor after task kill timeout 2018-01-23 2c0940ec Apoorva Jindal Temporarily, do not recover initialized tasks in non-initialized jobs in job manage 2018-01-19 be811e4d Apoorva Jindal Add changelog for 0.6.6.1. 2018-01-19 fdd75fe5 Apoorva Jindal Do not recover KILLED jobs. 2018-01-19 f7a9fa6a Apoorva Jindal Update Changelog for 0.6.6 release 2018-01-18 6a0ac85a Apoorva Jindal Change update task runtime success message to debug. 2018-01-18 c14e8290 Apoorva Jindal PENDING tasks should not be re-sent to resource manager 2018-01-16 4ac385bf Apoorva Jindal Cleanup in placement processor 2018-01-17 038e9ef6 Tengfei Mu Do not update task runtime for all orphan tasks 2018-01-18 3c0e21cb Casper Kejlberg-Rasmussen Add a stateful integration tests 2018-01-16 57accae7 Casper Kejlberg-Rasmussen Bugfix Mimir placement strategy and bump Mimir-lib 2018-01-15 053e1d80 Apoorva Jindal Fix make test. 2018-01-04 3c551109 Apoorva Jindal Task runtime information in the cache should be either nil or in sync with DB 2018-01-11 4ab2a819 Apoorva Jindal Mesos state STAGING maps to LAUNCHED state in peloton. 2018-01-11 a2f1d44f Mayank Bansal Adding Demand metrics as well updating static metrics with dynamic metrics 2018-01-11 1237db64 Apoorva Jindal Do not recover old terminated batch jobs with unknown goal state. 2018-01-02 c7983c72 Anant Vyas Automatically set GOMAXPROCS to match Linux container CPU quota 2018-01-09 d3d63989 Tengfei Mu Fix stateful placement engine to dequeue and place stateful tasks 2018-01-09 be7f55b8 Zhitao Li Add a counter about number of hosts acquired and released on hostmgr 2017-12-27 2ea54d98 Apoorva Jindal Add update API and DB schema 2018-01-08 3a4aa3d8 Charles Raimbert Add 50k & 100k tasks perf base jobs 2018-01-08 d477da25 Charles Raimbert Fix logging for ELK ingestion 2018-01-08 14a899b8 Casper Kejlberg-Rasmussen Change the placement models so that they will be json serialized when usin 2018-01-05 963690bb Mayank Bansal Adding API doc in peloton repo 2017-10-09 cfdd1462 Anant Vyas Add ability to preempt PLACING tasks 2018-01-03 3ac7df5e Tengfei Mu Use mimir placement strategy for stateful task placement 2018-01-03 b7c11722 Casper Kejlberg-Rasmussen The placement engine now returns failed tasks to the resource manager. 2018-01-02 c19c4c0d Casper Kejlberg-Rasmussen Add support for re-enqueuing unplaced tasks into the resource manager 2018-01-03 0cc2845b Charles Raimbert Update Changelog for 0.6.5 release 2018-01-03 80fa8fe9 Tengfei Mu Skip reschedule stateful task upon task lost event 2017-11-27 880cff21 Anant Vyas Refactor and add tests to resource manager `respool` pkg 2017-12-04 d3f68489 Aditya Bhave Implemented API to get Task Events 2017-12-31 50f2a74e Apoorva Jindal Make kill job faster and fix regression in create job 2017-12-28 2a29a8ae Apoorva Jindal Do not reschedule already scheduled INITIALIZED tasks 2017-08-28 059e755b Chunyang Shen Virtual Mesos cluster setup through Peloton Client 2017-12-29 926bbb9f Tengfei Mu Add cli command to list and clean persistent volume 2017-12-28 78ed4c8e Tengfei Mu Update volume state to be DELETED in resource cleaner 2017-12-07 9bac18eb Chunyang Shen Eable "shutdown executor" for hotmgr 2017-12-26 7e6cded1 Apoorva Jindal Implement job stop using job goal state 2017-12-28 d56ef9f7 Apoorva Jindal Take lock before reading/writing to job struct 2017-12-27 8abaec32 Apoorva Jindal Acquire read lock before getting job in tracked manager 2017-12-27 b23e88b2 Tengfei Mu Fix deployment script to ignore apps that doesn't exist 2017-12-26 4b4c2185 Antoine Pourchet Added mesos client for executor 2017-12-26 8f5d2dde Tengfei Mu Add option to start stateful placement engine in deployment script 2017-12-22 a7dec9ff Apoorva Jindal Allow a job configuration without a default configuration. 2017-12-19 d9b1932e Apoorva Jindal Add support for MaximumRunningInstances SLA configuration. 2017-12-15 732d4a6b Apoorva Jindal Implement job recovery in goal state engine in job manager 2017-12-18 c54a58d1 Apoorva Jindal Move task state to PENDING after enqueuing it to resource manager. 2017-12-19 28270e78 Zhitao Li [hostmgr] Separate reporting between no offer and mismatch status. 2017-12-13 88986d7d Apoorva Jindal Move creation of tasks and recovery into job goal state 2017-12-15 0958a0be Zhitao Li Change scalar.Resources methods from pointer receiver to non-pointer. 2017-12-14 99dbba35 Tengfei Mu Add changelog for 0.6.4 2017-12-14 cfe5911a Apoorva Jindal Skip terminal jobs during job manager sync from DB 2017-12-04 7aa206fc Antoine Pourchet Added the mesos podtask 2017-12-14 b7c7f61c Tengfei Mu Adding Changelog for 0.6.3 2017-12-13 381048fb Min Cai Increase MaxRecvMsgSize in gRPC to 256MB 2017-12-14 9e5fc15c Casper Kejlberg-Rasmussen Merge the placement engine from the master branch into release. 2017-12-13 0d3e5fdc Zhitao Li Move metrics gauage update to asynchronous. 2017-12-13 7d8034a2 Tengfei Mu Update volume state upon stateful task running status update 2017-12-13 b36e5562 Tengfei Mu Add more logging for jobmgr to launch stateful tasks 2017-12-13 cbe1a02d Mayank Bansal Fixing Integration test preprod cluster zk address 2017-12-13 bc95f4ea Tengfei Mu Add reservation cleaner to clean both unused volume and resources 2017-12-06 27442bf7 Apoorva Jindal Add job goal state to job manager 2017-12-13 0dde1556 Tengfei Mu Add materialized view for volume by state 2017-12-13 4abe1bbd Mayank Bansal Adding Changelog for 0.6.2 2017-12-13 9972f82d Mayank Bansal Adding more logging to entitlelement calculator in resmgr 2017-12-12 d76c97d7 Apoorva Jindal Revert "Store default config in job config and instance config in task config" 2017-12-12 4bf70660 Antoine Pourchet Revert "Check in mocks" 2017-11-30 63754d81 Apoorva Jindal Store default config in job config and instance config in task config 2017-12-12 3489f2f2 Mayank Bansal Adding deadline feature in Peloton 2017-12-08 5ea78a0f Tengfei Mu Update changelog for 0.6.1 2017-12-08 8f88ee2f Tengfei Mu Revert "Merge the placement engine from the master branch into release." 2017-12-08 4bb7f3d1 Anant Vyas Add changelog for changes between 0.5.0 and 0.6.0 2017-12-08 8c13634b Anant Vyas Improve Resource Manager recovery performance 2017-12-06 84b9513e Tengfei Mu Add materialized view for volumes by job ids 2017-12-07 c5044b9e Apoorva Jindal Update task runtime state when receiving a mesos kill event 2017-12-06 4203db0e Apoorva Jindal Do not update runtime reason on mesos update always 2017-12-06 7dfd42d8 Tengfei Mu Move volumesvc from hostmgr to jobmgr 2017-12-07 089fae7a Tengfei Mu Check in mocks 2017-10-19 175d75d9 Casper Kejlberg-Rasmussen Merge the placement engine from the master branch into release. 2017-12-05 cbbfd6a0 Apoorva Jindal Kill orphaned tasks in mesos 2017-12-05 e6304772 Tengfei Mu Implement volume list and delete API 2017-12-04 3f6e36cc Apoorva Jindal Add reason and message for every update to task runtime 2017-12-04 6a90d977 Apoorva Jindal Return failed instance list in task stop and task start 2017-12-01 9d06eb5a Apoorva Jindal Handle task start of failed tasks 2017-11-28 60c91aa9 Apoorva Jindal Restart the goal state when placement received for a task which needs to be killed. 2017-11-29 0712675e Apoorva Jindal Handle stopped tasks during reconcialiation. 2017-12-01 c2dddc9d Apoorva Jindal Add yaml files for performance tests 2017-11-30 9a0c9095 Anant Vyas Remove smoketest tag from preemption integ test 2017-11-21 2fb1a428 Apoorva Jindal Porting storage changes from master to release 2017-11-29 d0360b90 Tengfei Mu Add offer cleaner to clean up unused offers periodically 2017-11-27 e24bdb14 Mayank Bansal Resource Manager recovery should not recover Initialized jobs and not fail if some 2017-11-27 89284fcf Apoorva Jindal Add logs to debug job query failure in prod01 2017-11-27 b22b8353 Apoorva Jindal Fix gocql version in yaml file as well 2017-11-20 547fb61e Apoorva Jindal Upgrade gocql version. 2017-11-22 0f723981 Zhitao Li Use the same generated framework id if first time register to Mesos. 2017-11-22 381edadd Zhitao Li Use the same generated framework id if first time register to Mesos. 2017-11-22 6f9c9066 Casper Kejlberg-Rasmussen Merge the placement engine from the dev branch into master. 2017-11-21 7057a664 Tengfei Mu Skip rescheduling terminal state task on LOST status update 2017-11-21 a51a4fc0 Mayank Bansal Adding the check if flag is not present in cluster config 2017-11-21 20bd124c Tengfei Mu Mount secrets if not empty 2017-11-21 20d1708c Tengfei Mu Make secret_config_dir default to empty string 2017-11-21 2dc0a402 Tengfei Mu Make secrets config default to empty 2017-11-21 1549aff4 Mayank Bansal Fixing race condition for accessing the same memory from different threads 2017-11-21 8c8c3456 Mayank Bansal Fixing the environment veriable for default value 2017-11-15 bbac73b2 Anant Vyas Add per task preemption policy 2017-11-20 d83ccc19 Mayank Bansal Adding preemption flag by that it can be tuned based on cluster 2017-11-20 92310001 Tengfei Mu Add more fields into task change history table 2017-11-20 22d1da14 Tengfei Mu Specify agent id during explicit reconciliation 2017-11-20 ee17da82 Tengfei Mu Land commits default to release branch 2017-10-31 f5a4b183 Zhitao Li Move secret file/path lookup from merged yaml to environment variable. 2017-11-20 c92775a7 Tengfei Mu Merge conflict 2017-11-14 77a26f65 Tengfei Mu Add placement task type env var to support different placement engine type 2017-11-14 13b7abcc Tengfei Mu Update job state after start tasks call and start LAUNCHED tasks 2017-10-31 70e41782 Tengfei Mu Add canssadra auth capability 2017-10-20 b2a1124d Tengfei Mu add more logging on all handlers so that we can better debug UI/api calls 2017-11-06 8c6d102d Anant Vyas Fix possible resource leak with defer 2017-10-31 1512a566 Anant Vyas Add validation for submitting jobs to the root resource pool 2017-10-27 b663067d Anant Vyas Auto migrate should default as false 2017-10-24 538e5cc2 Anant Vyas Fix log fields formatting for Resource manager 2017-10-23 7efcc36e Anant Vyas Fix error wraping 2017-10-11 35565f67 Anant Vyas Check task resources when filtering for preemption 2017-11-06 2525868d Apoorva Jindal Change tag named type success/fail to result success/fail in storage layer. 2017-10-25 e6243dad Apoorva Jindal Handle Cassandra errors in the storage layer 2017-11-06 cfc7c4b8 Apoorva Jindal Add watchdog test 2017-11-15 7047e8f9 Anant Vyas Add per task preemption policy 2017-10-27 36f384fe Apoorva Jindal Add additional logs when AlreadyExistsErrorf occurs 2017-10-27 812396d0 Apoorva Jindal Add additional logs when AlreadyExistsErrorf occurs 2017-10-27 9cad48b7 Apoorva Jindal Handle runtime being null 2017-10-19 1bdf74b3 Apoorva Jindal Resend task to resource manager if launch task errors out 2017-11-15 33ce50a4 Apoorva Jindal Add sorting to peloton CLI 2017-11-15 bfb2ea28 Apoorva Jindal Add sorting to peloton CLI 2017-11-16 4c510e7b Mayank Bansal We need to only invalidate tasks if they are in PENDING or INITIALIZED State in res 2017-11-16 e26546ea Apoorva Jindal Increasing retry to 3. 2017-11-14 1eaf3e24 Tengfei Mu Update job state after start tasks call and start LAUNCHED tasks 2017-11-06 0f9c46c2 RAHUL MEHROTRA Adding peloton dashboard to version control 2017-11-16 2f71b22f Apoorva Jindal Fixing cherry pick. 2017-11-15 1c1c95d9 Apoorva Jindal Retry once in case of mesos system error. 2017-11-15 1d95d7dd Apoorva Jindal Retry once in case of mesos system error. 2017-11-14 706bb1dd Tengfei Mu Add placement task type env var to support different placement engine type 2017-11-06 2ebf70c0 Apoorva Jindal Change tag named type success/fail to result success/fail in storage layer. 2017-11-14 f8fc7d24 Apoorva Jindal Set cluster auto-migrate to false by default 2017-11-14 ec89face Apoorva Jindal Set cluster auto-migrate to false by default 2017-11-14 1d2a2608 Apoorva Jindal Fix GetTasksForJobAndState func call param during explicit reconcile 2017-11-14 dfb34b19 Apoorva Jindal Replace tag type with result 2017-11-14 7626f0a6 Apoorva Jindal API failure counters 2017-11-10 82a0ca3c Zhitao Li Clean up some unstructured log. 2017-10-30 8668eaba Anant Vyas Add ability to Update a resource pool 2017-11-09 8a3abb95 Tengfei Mu Fix GetTasksForJobAndState func call param during explicit reconcile 2017-11-06 6e9d9183 Anant Vyas Fix possible resource leak with defer 2017-10-30 6529f2fa Anant Vyas Add ability to Update a resource pool 2017-10-25 f8dfca08 Apoorva Jindal Handle Cassandra errors in the storage layer 2017-11-08 fdae953e Tengfei Mu Skip health check if not enabled 2017-11-02 aa9b54db Apoorva Jindal API failure counters 2017-11-06 35b5ef22 Apoorva Jindal Add watchdog test 2017-10-31 617d49d3 Anant Vyas Add validation for submitting jobs to the root resource pool 2017-11-03 79c8b34d Tengfei Mu Add MaxLimit as parameter for jobquery 2017-11-03 6020fafe Tengfei Mu Add MaxLimit as parameter for jobquery 2017-11-03 5ff7fd75 Apoorva Jindal Limit the number of go routines to create tasks 2017-10-31 218ca290 Apoorva Jindal Limit the number of go routines to create tasks 2017-11-03 a1b63777 Anant Vyas Use non synchronized `isLeaf` in `respool.CalculateAllocation` 2017-11-03 d3f7eb25 Anant Vyas Use non synchronized `isLeaf` in `respool.CalculateAllocation` 2017-10-05 819a1a4a Francois Galilee Nano timing for log 2017-08-16 503b830c Anders Johnsen Implement basic upgrade logic 2017-10-31 f4e536e4 Anders Johnsen Use GSE for initial Gang scheduling 2017-10-31 fa9efa2b Tengfei Mu Add canssadra auth capability 2017-10-31 fb393b86 Zhitao Li Move secret file/path lookup from merged yaml to environment variable. 2017-11-01 a84a1073 Anders Johnsen Downgrade GSE logging form err to info and log error count 2017-10-27 ac265bc2 Apoorva Jindal Add additional logs when AlreadyExistsErrorf occurs 2017-10-31 4a50cfae Francois Galilee Fix race in event tests 2017-10-31 e9945a02 Francois Galilee Add integration test for gang scheduling 2017-10-31 f8cf10d7 Francois Galilee Go through the tracked Job to perform tasks actions to prepare for SLA compliance a 2017-10-31 a9422f3e Anders Johnsen Follow style guide for `service` naming in upgrade API 2017-10-27 25dbb51c Anant Vyas Auto migrate should default as false 2017-10-26 42e462f5 Anders Johnsen Redo Upgrade API before we expose it in clients 2017-10-20 51df4675 Tengfei Mu Simplify mesos connection loop code 2017-10-27 76fc5b61 Apoorva Jindal Handle runtime being null 2017-08-28 e0b6b6fb Chunyang Shen Virtual Mesos cluster setup through Peloton Client 2017-10-25 746b3741 Apoorva Jindal Replace tag type with result 2017-09-25 879b2b84 Anders Johnsen Add new task states PENDING_HEALTH and UNHEALTHY for pending/failing health checks 2017-10-26 88cc28e5 Anders Johnsen Fix cache clear of task runtime, when cached revision is higher than 0 2017-10-24 897e13ef Anant Vyas Fix log fields formatting for Resource manager 2017-10-23 814bd663 Anant Vyas Fix error wraping 2017-10-11 b5de9b65 Anant Vyas Check task resources when filtering for preemption 2017-10-23 b48e03a0 Mayank Bansal While recovery in resmgr running and launched states are not moved to correct state 2017-10-23 1c18cba0 Mayank Bansal While recovery in resmgr running and launched states are not moved to correct state 2017-10-19 712ad148 Apoorva Jindal Resend task to resource manager if launch task errors out 2017-10-23 2ddc26be Anders Johnsen Generalize the scheduler component of the jobmgr tracker 2017-10-23 7bcf5d23 Anders Johnsen Mark tests with `pcluster` if start/stop of containers is part of the test 2017-10-22 7d77bbec Anders Johnsen Undo breaking protobuf changes 2017-10-23 cca58a94 Anders Johnsen Fix master 2017-10-23 9e7edc09 Anders Johnsen Process event listeners after primary handling 2017-10-19 418da8bd Anders Johnsen Limit usage of JobConfig in jobmgr 2017-09-19 cea80938 Anders Johnsen Make configs (job and task) immutable 2017-10-17 0bfa29ae Mayank Bansal Fixing automigrate for production 2017-10-20 8f11a77a Tengfei Mu add more logging on all handlers so that we can better debug UI/api calls 2017-10-19 84b60b0c Tengfei Mu change defaultQueryLimit and unset instanceConfig from jobquery response 2017-10-15 eaf84ca7 Anders Johnsen Clear cached task runtime if runtime update fails 2017-10-19 d13e1f20 Tengfei Mu change defaultQueryLimit and unset instanceConfig from jobquery response 2017-10-10 0a95c269 Francois Galilee Cherry pick: C* load reduction: do not update JobConfig columns in job_index when o 2017-10-15 1d32ca99 Anders Johnsen Take copies of task runtime in tracked, ensuring changes to runtime is not leaked 2017-10-18 027a79cd Anders Johnsen Merge branch 'dev' 2017-10-17 70a0147e Mayank Bansal Fixing automigrate for production 2017-10-11 1c809d95 Anant Vyas Disable preemption by default 2017-10-16 ad1452cc Mayank Bansal Rejecting tasks if they have been removed from the tracker 2017-10-13 c8a2395a Justin Pinkul Fixing a typo in the Peloton task service protobuf 2017-10-10 273fc42e Mayank Bansal Fixing demand when killing tasks in pending queue 2017-10-05 68c09b5e Anant Vyas Add ability to preempt READY tasks in resource manager 2017-10-04 baeda58a Apoorva Jindal Fix the job recovery path 2017-10-05 bb410fc5 Anant Vyas Adds task preemptor for RUNNING tasks in jobmgr 2017-09-29 b484ff1d Shi Lu Comment out the CAS logic when updating the task runtime. This is to test if it mit 2017-10-02 6af37f3a Anant Vyas Add preemption API in resource manager 2017-09-26 ca2d97b7 Chunyang Shen using flag in query 2017-09-28 d94e6b67 Mayank Bansal Removing tasks from the pending queue during admission control if they are killed. 2017-09-26 512b05ca Min Cai [API] Refactor job, task and respool service APIs 2017-09-14 1f028268 Anant Vyas Initial checkin for task preemption support in resource manager 2017-09-20 f6f33b12 Min Cai Bump up GO version to 1.8 for deb package build 2017-09-19 69312b77 Charles Raimbert Fix broken Peloton ELK due to type of `msg_.job_id` 2017-09-19 3c5f9ce5 Casper Kejlberg-Rasmussen Add a integration test that runs a 100 tasks with different resource const 2017-09-13 65a4cd0c Anders Johnsen Use UUID4 instead of UUID1 2017-09-11 3a152784 Anders Johnsen Add revision(version) field to task runtime 2017-09-19 6b0f00e2 Anders Johnsen Fix pagination when using respool and simplify query 2017-09-19 9cb61afa Anders Johnsen Fix race in test 2017-09-14 b701be79 Anders Johnsen Use grpc internally 2017-09-15 6448db1b Min Cai Set max gRPC message size to 64MB 2017-09-14 463b3faf Francois Galilee Batch the fetch of tasks from taskStore when recovering a job. 2017-09-08 50ab4ac2 Mike Deats Ensuring state count metrics do not go negative, and that we do not update counts f 2017-09-12 3f50b4b5 Mayank Bansal Fixing Kill for initialized tasks 2017-09-11 ab9d01d5 Anders Johnsen Improve performance of computing job runtime stats 2017-09-07 35244796 Mayank Bansal Adding more logging and reverting timeouts from dev yaml 2017-09-08 b5660283 Mike Deats Increasing max wait time and allowing for instance failures during deploy 2017-09-07 2197092b Francois Galilee Add missing metrics for JobUpdate API endpoint. 2017-09-06 12021fe8 Francois Galilee Fix bug: JobRuntimeUpdated metrics are now updated when the ticker StateUpdateInter 2017-09-06 cf208310 Francois Galilee Fix bug: error from taskStore are now properly reported by Task Start and Stop API 2017-09-08 f50e9702 Francois Galilee Add Active Tasks and Jobs Metrics in Tracker Manager. 2017-09-08 33b00829 Casper Kejlberg-Rasmussen Refactor the placement mappers and models 2017-09-08 ed374bdf Francois Galilee Fix risk of unintended race: job's aging check during job's recovery is now based o 2017-09-07 2739aa16 Anders Johnsen Clear tracker when jobmgr is stopped 2017-08-25 fc6aaf27 Anant Vyas Revert "Revert "Add validation for the root resource pool"" 2017-08-29 5e093bc1 Anders Johnsen Remove old start/retry code in favor of goal-state engine 2017-08-21 d0c68fe8 Anders Johnsen Support killing tasks while in the INITIALIZED state 2017-09-05 7371552e Mike Deats M3 config for cluster tag is not being correctly overriden by CLUSTER environment v 2017-09-02 0f788b8a Anders Johnsen Don't lock runtime updater while processing job 2017-08-31 2d0f7e1b Mayank Bansal Fixing floating point round off errors during resource subtraction 2017-08-31 ad66b34c Mayank Bansal Task Scheduler to first transit to ready and then add the gang to ready queue 2017-08-31 01061a65 Francois Galilee Add tracked manager metrics. 2017-08-30 10778275 Anders Johnsen Untrack tasks when they become unactive 2017-08-31 10b44f7e Francois Galilee Move start-task logic into goal-state engine / tracker 2017-08-30 99c1abfc Anders Johnsen Fix job lucene query string 2017-08-24 0da3cfa4 Anders Johnsen Use tracked manager in task updater directly 2017-08-30 42baa6fa Anders Johnsen Fix linting 2017-08-29 a0bb423b Anders Johnsen Refactor UpdateTask from Job to Manager in tracked 2017-08-29 310f2a4c Mayank Bansal Adding task information into state recovery logging 2017-08-29 8ae36100 Justin Pinkul Deployment script changes to support WBU2 2017-08-25 1a785ddb Anders Johnsen Unify creation of INITIALIZED tasks 2017-08-28 734721a9 Mayank Bansal Exiting if resmgr recovery fails 2017-08-24 c3880571 Anders Johnsen Move handle of events to traced manager, and clean up APIs 2017-08-25 776cf3a9 Anders Johnsen Use short 1s job state refresh, in development setup 2017-08-28 5bc08ba6 Casper Kejlberg-Rasmussen Removed extra names in Mesos attributes and bump integration test resource 2017-08-23 e653de38 Justin Pinkul Adding the ability to set the auto migration flag via the CLI or an environmental v 2017-08-25 2eab0f9e Mayank Bansal Optimizing Entitlement calculation and removing dead lock when resources become les 2017-08-25 e8f95776 Tengfei Mu increase offer timeout from 5mins->30mins to reduce offer churns 2017-08-23 52233007 Tengfei Mu add volume manager handler 2017-08-24 9480813e Shi Lu Adjust the log lines for sentry, by using log.withField() instead of formatting 2017-08-22 aeec0f55 Shi Lu Change the host manager task update to ack the mesos status update after event are 2017-08-23 4fb206ec Casper Kejlberg-Rasmussen Add a file descriptor resource to the integration tests. Also add Mesos at 2017-08-23 609b1f6e Tengfei Mu parse the enable_sentry_logging flag 2017-08-23 6af1ee9c Mayank Bansal Hooking up physical capacity to entitlement 2017-08-23 1bfb65de Tengfei Mu refactor volume API to use peloton.VolumeID everywhere 2017-08-22 d1c5b1b4 Tengfei Mu change ready/placing offer metrics when we reset offers, add more logging 2017-08-23 d13288b2 Anant Vyas Revert "Add validation for the root resource pool" 2017-07-13 6fde824d Anant Vyas Add validation for the root resource pool 2017-08-21 f20412e7 Tengfei Mu proto change for volume manager api 2017-08-21 f7fd82fd Mayank Bansal Fixing validation for initialized or pending state 2017-08-21 3614af70 Mayank Bansal Cleaning up sentry warnings 2017-08-21 410c74b0 Tengfei Mu disable resmgr task transition logging 2017-08-15 8e0fe777 Zhitao Li Remove checkpoint from config and force it to be always true. 2017-08-14 86f495aa Min Cai Reland: Upgrade python client to use gRPC for integration tests 2017-08-18 2d0f74c7 Anders Johnsen Split goalstate engine and tracker into seperate packages, with seperate responsibi 2017-08-21 41356596 Anders Johnsen Revert "Upgrade python client to use gRPC for integration tests" 2017-08-14 cae8c148 Min Cai Upgrade python client to use gRPC for integration tests 2017-08-19 b84913d5 Mayank Bansal Removing resources only after Initialized or Pending State 2017-08-19 7f9b9f87 Tengfei Mu add more logging to hostmgr host pruner 2017-08-18 5a433bb0 Anant Vyas Add task reconciliation in resource manager 2017-08-17 00ade6f5 Mayank Bansal JobManager should call Resmgr Kill API's 2017-08-19 7d35cfce Tengfei Mu add more logging in jobmgr 2017-08-19 1e7f64ad Shi Lu Correct the logic for per-job lastTaskUpdateTime handling 2017-08-18 c0076529 Shi Lu Adjust job runtime updater check all job frequency 2017-08-18 422c15e4 Jimmy Wu Revert "Retry immediatly if dequeue-gang times out in placement" 2017-08-18 48cd3120 Tengfei Mu add physical resource capacity 2017-08-17 8b3188a1 Tengfei Mu retry launch batch tasks 2017-08-16 f6e047c1 Shi Lu Add counters for DB execute / execute batch failure / success counts 2017-08-18 91e997cc Casper Kejlberg-Rasmussen Bump Mimir-Lib and fix unnoticed Go Vet errors on master branch. 2017-08-18 2c426271 Casper Kejlberg-Rasmussen Refactor the placement engine loop 2017-08-17 174708b6 Mayank Bansal Fixing respool ID in Job Update API 2017-08-15 244f3230 Tengfei Mu add general retry module 2017-08-17 aa4a8ce1 Mayank Bansal Removing Resource Pool ID from respool protobuf and use peloton respool id 2017-08-17 44139605 Charles Raimbert Add host pruning as a background work 2017-08-16 ff9ea1f9 Anders Johnsen Refactor goalstate engine to track jobs as well 2017-08-16 b2a75f3c Anders Johnsen Use go 1.8 2017-08-16 bc23cc80 Mayank Bansal Adding per task states 2017-08-15 ea6a2882 Mayank Bansal Adding Delete ResourcePool API in Resmgr 2017-08-15 6c32dd73 Jimmy Wu Add script for cassandra schema migrations 2017-08-14 febe1d37 Shi Lu Add AddEvent metrics for event stream handler 2017-08-15 e159920a Anders Johnsen Change grpc port for resmgr, due to conflict with kafka 2017-08-14 d99fc5ac Charles Raimbert Peloton ELK: fix dynamic keys in logs 2017-08-14 ada19cb4 Shi Lu Adjust log level line for setry 2017-08-07 0d1fa5ed Anders Johnsen Use a priority heap to schedule goalstate engine workload, based on timeouts 2017-08-11 050e752f Tengfei Mu emit task stop & lost stats 2017-08-09 aa3bb4e9 Anders Johnsen Expand integration test suite to allow start/stop of containers 2017-08-11 7b5c1a84 Anders Johnsen Fix mesos secret paths for non-docker hostmgr 2017-08-09 4b5dc8c6 Zhitao Li Support HTTP authentication on Mesos HTTP API. 2017-08-10 2d70da1f Mayank Bansal CLI changes for deleting the resource pool 2017-08-10 94e817c4 Anders Johnsen Retry immediatly if dequeue-gang times out in placement 2017-08-04 ac031904 Tengfei Mu add mesos_agent_work_dir env into deployment script 2017-08-09 299adfa5 Jimmy Wu Emit leader metrics from heartbeat 2017-08-09 e6eaae59 Zhitao Li Try to fix integration test. 2017-08-09 a512423a Mayank Bansal Making Resmgr Task idempotent 2017-08-09 44fa4f9b Zhitao Li Try to use default error code from yarpc. 2017-08-08 e31f763d Anders Johnsen Expose grpc ports in pcluster 2017-08-08 f2824f8a Tengfei Mu populate persistent disk resources after assembling task launch resources 2017-08-08 dfc72cb5 Shi Lu Add capability to specify multiple states in the query job api 2017-08-08 9a32dabd Mayank Bansal Refactor and fix event stream updates acknowledgement 2017-08-08 1b7b0bb1 Casper Kejlberg-Rasmussen Add a copy of the Mimir library and a make target to update it. 2017-07-05 28e5794f Anders Johnsen Add skeloton upgrade manager for progressing upgrades 2017-08-07 d41fa38c Anders Johnsen Add initial goalstate config 2017-08-07 4f802286 Anders Johnsen Make goalstate JMTask and Tracker private 2017-08-07 723d6dc9 Anders Johnsen Sync task state from DB at start (recover and leader change) 2017-08-07 e139ebe5 Anders Johnsen Rename goalstate keeper to engine 2017-07-31 6d8eb71f Anders Johnsen Track and evaluate state vs goal-state in jobmgr task keeper 2017-08-04 0e90bc19 Jimmy Wu Support setting local datacenter for Cassandra 2017-08-04 e4672a7a Tengfei Mu add mesos workdir env 2017-08-03 a3a69c53 Tengfei Mu query slave logs by constructing url directly 2017-08-03 371a515a Mayank Bansal Refactoring set placements 2017-08-02 4115a46b Mayank Bansal Optimizing remove tasks call and sep placement calls for multiple locks 2017-07-31 d584fb63 Min Cai Support gRPC transport for Java and Python clients 2017-08-01 c6963979 Mayank Bansal Adding kill Task support in resource manager 2017-08-01 92529849 Shi Lu Add test cases to job query paging, also add order by support 2017-07-31 f8b800c7 Tengfei Mu save mesos agent id in task runtime 2017-08-01 d7088880 Jimmy Wu Enforce no-cache for docker build 2017-07-30 dc6fc88b Anders Johnsen Fix missing PELOTON envs when env is empty 2017-07-31 bba77e9c Mayank Bansal Fixing Tests for Peloton hostmgr service 2017-07-31 f3093de2 Mayank Bansal Adding Kill Api 2017-07-28 c4ee2ddf Jimmy Wu Change peloton docker registry path from uber-usi/peloton to vendor/peloton 2017-07-28 957d3af5 Alexey Pavlenko Minor fix of `make pcluster-teardown`. 2017-07-27 137ddc5e Jimmy Wu Add support to run integration tests by tags 2017-07-28 db9c17a7 Mayank Bansal Introducing Launching state and adding timeout for launching 2017-07-26 b4951198 Tengfei Mu clean up hostmgr logs 2017-07-26 97d125a3 Charles Raimbert Add RegisterWork method as part of Background Manager interface 2017-07-26 d756b5be Charles Raimbert Refactor hostmgr/offer go pkg 2017-07-25 4abb8e86 Tengfei Mu retry schedule if hostmgr doesn't have enough resource to run task 2017-07-25 743efb3c Mayank Bansal Caping entitlement to demand and redistribute rest of the resources to other respoo 2017-07-24 b4e9113d Anant Vyas GetChildReservation should not return error when there are no children 2017-07-25 22fb0e81 Anant Vyas Pin mockgen to v1.0.0 2017-07-20 9c583114 Anant Vyas Move DB migrations to hostmgr 2017-07-22 6d4d2b67 Min Cai Support SJC1/DCA1 clusters in deployment script 2017-07-21 00f1eaeb Tengfei Mu retry task launch if when launch get invalid offer failure 2017-07-21 594f917f Tengfei Mu start stateful instance to launch on reserved resource directly 2017-07-20 981a96d0 Mayank Bansal Fixing sentry alert for noop error 2017-07-20 4e279703 Jimmy Wu Add endpoint to dump resmgr task states 2017-07-19 25addbca Anders Johnsen Upgrade YARPC to version 1.11 2017-07-19 2baf8ef3 Charles Raimbert Refactor background Manager inside common/background/work.go 2017-07-19 680392a4 Tengfei Mu hook up task goalstate keeper with status update event stream 2017-07-19 ea161de1 Jimmy Wu Update deploy config with latest build 2017-07-18 564dbb73 Min Cai Initial implementation of Peloton deployment tool 2017-07-19 5e01282d Jimmy Wu Drop infra in docker image url 2017-07-19 1e57c861 Jimmy Wu Update aurora deployment files to generate a new build 2017-07-18 27e418a3 Jimmy Wu Remove code to tag and push latest for Peloton docker image 2017-07-18 f7a06a46 Mayank Bansal Adding demand and resource hierarchy to entitlement calculation 2017-07-17 868b2849 Tengfei Mu run peloton on preprod sjc1/dca1 clusters 2017-07-13 67dd82e0 Jimmy Wu Change batch test to not use docker images 2017-07-13 dc8788d4 Jimmy Wu Fix the regression in task creation 2017-07-11 6f709f37 Tengfei Mu tag sentry event with cluster id 2017-07-10 fa2ae43c Tengfei Mu integrate sentry with peloton 2017-07-10 a0d6f588 Tengfei Mu use lowercase sirupsen/logrus pkg 2017-07-10 c0a29219 Tengfei Mu only acuiqre up to tasks num hosts 2017-07-10 8ec1bc9f Tengfei Mu add more debug logging into pe 2017-07-06 a977f6d2 Justin Pinkul Adding the java pacakage configuration for compatability with gRPC JVM clients. 2017-06-27 97b22d2e Anant Vyas Fix res manager recovery to comply with gang semantics 2017-07-05 303491fe Anders Johnsen Delay creation of Peloton specific env vars to mesos task materialization 2017-07-06 cb4f15e1 Anders Johnsen Detach tasks creation from Create handler 2017-06-29 81fb4b6c Anders Johnsen Redo job/task tables, split runtime and config, and introduce versioning 2017-07-07 4abad711 Anders Johnsen Delete old mesos containers, when doing pcluster teardown 2017-06-27 08ce56d4 Anders Johnsen Cap cassandra container to 1 GB of memory 2017-07-06 7657e146 Anders Johnsen Teardown all mesos-related containers before starting pcluster 2017-07-06 3f3c9a56 Anders Johnsen Lower task count from 1000 to 50 in integration test 2017-06-28 c57ba19f Anders Johnsen Create the upgrade in Storage on valid input 2017-07-06 3e9f563b Anders Johnsen Revert "Redo job/task tables, split runtime and config, and introduce versioning." 2017-06-29 224b305b Anders Johnsen Redo job/task tables, split runtime and config, and introduce versioning. 2017-07-06 8a36d5af Casper Kejlberg-Rasmussen Add a 1k batch integration test job. 2017-07-05 2e2c18a6 Shi Lu update the cql content after renaming an index 2017-07-05 7c7b439a Anders Johnsen Fix Timer vs Ticker in resmgr calculator 2017-06-27 9f146578 Shi Lu Add job_index table which keeps information for both job runtime and job config 2017-07-05 5de8c2c8 Tengfei Mu skeloton code for task goalstate keeper 2017-07-04 825ba5e6 Anders Johnsen Trigger entitlement calculation when the resource pool changes 2017-06-27 c8b6207e Casper Kejlberg-Rasmussen Add an endpoint to the task.Tracker to get all tasks of a given task type 2017-06-26 a91de358 Casper Kejlberg-Rasmussen Make the Scheduler.Dequeue method blocking and respect the timeout paramet 2017-06-30 8487c09d Mayank Bansal Resource counting of running tasks while recovery and fixing hostmanager event stre 2017-06-30 abd3e454 Anders Johnsen Calculate entitlement at start, and reuse Go timer 2017-06-29 007342f8 Tengfei Mu parse jobid and instanceid from mesos task id 2017-06-27 696b51ad Anant Vyas Fix tagging of resource pool metrics with respool path 2017-06-29 48ed9abf Tengfei Mu implement task start api 2017-06-29 36bace02 Jimmy Wu Not log error when there is no gang/task in pending queue 2017-06-29 8dd60a70 Tengfei Mu hostmgr write volume into db when launch task 2017-06-29 b4d2e7c9 Jimmy Wu Refactor cli job get command to show full getJobResponse 2017-06-22 4d4ae04f Min Cai Add Aurora client and thrift API files 2017-06-28 dc60d52a Anders Johnsen Implement basic storage operations for workflows 2017-06-22 88415d28 Anders Johnsen Fix QueryTasks on cassandra 2017-06-27 d654e669 Anders Johnsen Add skeleton code for Upgrade Manager in jobmgr 2017-06-21 af8839a2 Shi Lu Modify task table primary key to include intance id 2017-06-26 8df10601 Tengfei Mu add launchstateful task call for statefultask 2017-06-26 abf91fbc RAHUL MEHROTRA Fixes package version to allow naming scheme for dak 2017-06-22 744b525d Zhitao Li Create periodic refresher for hostmap. 2017-06-23 b63a8946 Casper Kejlberg-Rasmussen Add support for getting all tasks on a given set of hosts. Diff made to se 2017-06-20 4636d723 Casper Kejlberg-Rasmussen Support dequeing tasks by type 2017-06-22 22bd4c18 Anders Johnsen Implement DeleteJob in cassandra 2017-06-22 94f2ad09 Mayank Bansal Calculating usage recursively for all the childs 2017-06-22 d20f2635 Jimmy Wu Generate integration test report 2017-06-22 715212d5 Casper Kejlberg-Rasmussen Add Cassandra database migrations to add the TaskInfo field to the materia 2017-06-20 8470776b Mayank Bansal Adding timeout for Placing to Ready state in remgr 2017-06-19 75a04fa4 Min Cai Upgrade peloton to 0.2.3 2017-06-19 9de02c75 RAHUL MEHROTRA Fixing package name 2017-06-19 bce7b4db Jimmy Wu Upgrade peloton-client to 0.2.3 for integration test 2017-06-17 9754183f Anders Johnsen Roll yarpc to version 1.9.0 2017-06-18 3551639f Min Cai Deploy 0.2.2 to IRN1/SJC1/DCA1 2017-06-17 942c3d92 Min Cai Add start and completion timestamps to task runtime 2017-06-16 d4a585d1 Anant Vyas Fix recovery to only enqueue unfinished tasks 2017-06-16 f92ea3d1 Zhitao Li Fix port picking logic for non-default role. 2017-06-15 cfd0e315 Tengfei Mu retry reschedule on task lost and discard orphan task status update 2017-06-13 e2c5f6ca Min Cai Fix deb package build 2017-06-13 3e731d8d Anant Vyas Add Metrics for Resource Pool Pending Queue Size 2017-06-12 9fb0faaf Tengfei Mu add metric for available hosts 2017-06-08 62c6f839 Min Cai Remove pinned hosts and upgrade to 0.1.0-250-ge82abd4 2017-06-09 36cf36da Shi Lu Initialize the task stats for a job to consider all tasks as initialized 2017-06-09 df03290e Anant Vyas Fix integration tests 2017-06-09 00e1cc2d Anne Holler Upgrade peloton-devel01 to latest 2017-06-08 1342f2d3 Anant Vyas Add metrics for resource manager 2017-06-09 18841918 Anders Johnsen Return job runtine when doing jobs query 2017-06-09 ed73ef41 Anders Johnsen Fix job query for labels/keywords containing double quotes 2017-06-08 267ee697 Anders Johnsen Don't request empty keywords in cli when executing job query 2017-06-07 e82abd4f Anders Johnsen Report metrics through YARPC 2017-06-06 33ab1c46 Anders Johnsen Avoid cassandra dependency in resource manager tests 2017-06-07 5d1ccc00 Tengfei Mu remove task handler info logs and also iterate completed_frameworks for logs 2017-06-07 b057d3d6 Tengfei Mu download logs files then output in console 2017-06-07 84e71b25 Min Cai Temporarily set max_batch_size_rows to 1 2017-06-07 f78be931 Anne Holler Upgrade irn1 peloton-devel01 2017-06-06 36f99d43 Anders Johnsen Generalize task and job query with pagination support 2017-06-07 9c8a7e9e Zhitao Li Expose filter result in hostmgr interface. 2017-06-06 729dd9a7 Mayank Bansal Adding more tests 2017-06-06 0d4b6e34 Anne Holler Convert ready queue to gang entries; unify the pending and ready queue gang represe 2017-05-17 4a8f89e2 Anders Johnsen Move Label(s) from mesos to peloton API 2017-05-24 25184958 Anant Vyas Fix resource manager task refill for cassandra store 2017-06-05 ae54df37 Mayank Bansal Adding resource counting 2017-06-05 3fc5db05 Shi Lu Add example peloton job for using updated atc laptop image (with CUDA 8 / tf 1.1.0) 2017-06-02 51b62391 Mayank Bansal Adding Usage Api 2017-05-30 9b79bd3a Anders Johnsen Remove panic recovery from placement and remove rootCtx 2017-05-30 97b4c2db Anders Johnsen Add context to Storage API and and upstream 2017-06-01 67f2593e Anders Johnsen Add arguments to generate-protobuf so output and package can be specified 2017-06-01 f1dcbf96 Min Cai Upgrade peloton-devel01 to 0.1.0-225-gaf2dafd 2017-06-01 8c8e0c52 Tengfei Mu adding offeroperationsfactory to support reserve/create/launch batch call 2017-06-01 8bfe28f4 Jimmy Wu Deploy cassandra via aurora and change sjc1/dca1-devel01 to use cassandra 2017-06-01 80a83a9a Jimmy Wu Mount cassandra lucene jar and clean up pcluster 2017-06-01 b7634044 Anders Johnsen Return error on lookup, if respool is unitialized 2017-06-01 6a232047 Anders Johnsen Fix panic when labels are empty in mysql job query 2017-05-31 02f45822 evelynl Change mesos master's registry config to replicated_log 2017-05-31 af2dafdc Anders Johnsen Roll yarpc version and use protobuf encoding 2017-06-01 524c8dbe Povilas Versockas Make zk & mesos agent ports configurable in pcluster 2017-05-31 f4ce12a4 Mayank Bansal Fixing test after disabeling recovery 2017-05-30 5eb6c5d8 Mayank Bansal Avoiding tasks which are recovered 2017-05-30 71a7ebed Min Cai Fix enqueue gangs change 2017-05-30 9b84d00d Jimmy Wu Return all jobs if both label and respool are empty in job query request 2017-05-30 98b033bd Jimmy Wu Periodically emit heartbeat metrics from all apps 2017-05-30 5ea5cfe4 Anne Holler Change private interface to resmgr pending queue from list of tasks to list of gang 2017-05-26 f97e1926 Shi Lu Skip the default empty string for query keywords 2017-05-26 640fedf4 Tengfei Mu update volumestore whenever receiving volume resources in offers 2017-05-26 d4923441 Tengfei Mu add tasktype into resmgr task/placement and refactor jobmgr utils 2017-05-26 964f9101 Jimmy Wu Remove redundant mysql migrate code from resmgr main 2017-05-25 9a500871 Tengfei Mu refactor files in jobmgr/task dir to different subpkg 2017-05-24 967d77ed Shi Lu Add implementation of TaskStore.QueryTasks() for c* store 2017-05-25 6fc17074 Jimmy Wu Update pcluster to launch multiple instances for each app 2017-05-25 70f8aa57 Mayank Bansal Removing update status for resmgr as a workaround for T936171 2017-05-24 29f2c4f7 Tengfei Mu add SandboxFileBrowse rpc call in jobmgr 2017-05-24 55b075c7 Mayank Bansal Fixing update root res pool for cluster capacity 2017-05-24 d8d19b36 Zhitao Li Add some metrics to offer pruner. 2017-05-24 0d972bb3 Jimmy Wu Enable m3 for deployment files 2017-05-24 f5f993d0 Shi Lu Adjust c* store batch create task size in rows 2017-05-24 119bd516 Zhitao Li fix lint for script. 2017-05-23 b886e4bb Jimmy Wu [Metrics] Add support of tally multi reporter to optionally emit to M3 2017-05-22 cafa7120 Anne Holler Address go vet failures 2017-05-12 d541da37 Anant Vyas Add ability to increase instance count of a job 2017-05-18 e0a35f3f Shi Lu Add keyword search for jobs 2017-05-19 5709be79 Min Cai Fix timestamp format and add job runtime startTime 2017-05-18 0fc4aad8 Min Cai Print cuda version for GPU container 2017-05-18 293cd568 Tengfei Mu protobuf part of offeroperations api 2017-05-05 4ddf68d3 Zhitao Li Split HostSummary to its own package. 2017-05-18 6e3c8bce Jimmy Wu Upgrade tally to ^3.0.1 2017-05-18 79d7e482 Shi Lu Add fulltext index on jobconfig field. 2017-05-18 901203ec Anne Holler MVP Gang Scheduling Support for Gang Resource Pool Admission Control 2017-05-18 db063960 Min Cai Stop make on protobuf compilation errors 2017-05-17 55eeb6b6 Mayank Bansal Adding state recovery in state machine 2017-05-17 fac94ce7 Jimmy Wu Rename client to cli 2017-05-17 f86cc636 Povilas Versockas Fix protobuf geneneration and import path 2017-05-12 51ba45c8 Min Cai Upgrade sjc1 and irn1 devel01 to 0.1.0-185-g47f26b5 2017-05-16 ec68f0c6 Shi Lu Add optional flag in the db pressure tool to check if update to a row is applied. 2017-05-12 537a25b6 Shi Lu Add capability to query jobs by respool 2017-05-12 47f26b5d Min Cai Add Peloton specific environment variables 2017-05-12 5fdf1326 Jimmy Wu Deploy latest build 0.1.0-183-g605141f to IRN1 2017-05-12 605141fb Jimmy Wu Add integration test framework 2017-05-10 092d904c Shi Lu Add full text query support for jobs by labels in C* store 2017-05-11 5e40194a Mayank Bansal Fixing slaconfig nil issue T865795 2017-05-10 d3096aca Jimmy Wu Remove peloton-master related code and configs 2017-05-11 36c4eb22 Mayank Bansal Removing legacy task queue 2017-05-11 397ceaf4 Mayank Bansal Adding resmgr statemachine implementation with event stream handler 2017-05-09 6e2af68e Jimmy Wu Fix job runtime completion time calculation 2017-05-09 d3ab8534 Min Cai Add an example job file for otto TF training 2017-05-09 2388258c Tengfei Mu allow peloton client to create job with jobid passed in 2017-05-08 8390b868 Jimmy Wu Deploy latest build 0.1.0-172-g556c1a3 to sjc1/dca1/irn1 2017-05-08 e1009078 Tengfei Mu add persistent volumes table and its interface 2017-05-08 556c1a3b Jimmy Wu Deprecate yarpc svc name peloton-master in favor of peloton-jobmgr 2017-05-05 234dc824 Min Cai Upgrade irn1 peloton-devel01 to 0.1.0-170-g9d7994b 2017-05-05 9d7994b5 Min Cai Revert "Enable capabilities to be injected from config." 2017-05-05 f727404b Shi Lu add code to pressure test c* 2017-05-05 3cac2392 Jimmy Wu Deploy latest build 0.1.0-167-g123080e to irn1/dca1 2017-05-04 123080e1 Jimmy Wu Switch all irn1 clusters backend storage back to cassandra 2017-05-04 07a23c2f Tengfei Mu preserve ports and hostname information in jobmgr 2017-05-04 867c6f69 Zhitao Li Enable capabilities to be injected from config. 2017-05-04 e562e77b Jimmy Wu Temporarily switch irn1 clusters to use mysql as storage 2017-05-03 cca9c696 Anne Holler Add minimumRunningInstances 2017-05-02 f0384216 Jimmy Wu Add deployment file for irn1 prepro01 and use cassandra 2017-05-01 633639e8 Zhitao Li Test for hostmgr/mesos package. 2017-05-02 93cc26c5 Ankit Khera adding unit tests for scheduler client + refact client name to reflect intent 2017-05-01 d2330b0f Jimmy Wu Added cassandra store config override 2017-05-02 044fcfde Tengfei Mu proto change for persisting ports and hostname information 2017-04-18 aad08aba Anders Johnsen Add generic Queue and Pool to common/ and use in placement/ 2017-05-01 4b55d233 Zhitao Li Fix makefile to make it working for go1.8. 2017-05-01 22e69a55 Ankit Khera refact field name, use peloton.TaskID 2017-05-01 164b66ab Ankit Khera initial photo changes for TaskEvent 2017-05-01 59114bb5 Mayank Bansal Deploying version peloton:0.1.0-152-g44cce80 to irn1 devel 2017-05-01 44cce80f Mayank Bansal Adding Entitlement Calculator 2017-04-27 da7f820e Anant Vyas Rename test resource pool config and update README 2017-04-27 ffe61188 Anant Vyas Use UUID for ResourcePoolID 2017-04-26 40ce8062 Anant Vyas Validate resource config during create 2017-04-27 d14042bf Anant Vyas Use resource pool path for job submission 2017-04-17 ed82de53 Zhitao Li Support task to host constraint. 2017-04-26 48adcd9e Tengfei Mu deep copy object then pass to another thread 2017-04-22 15caba27 Shi Lu Add logic to recover job tasks when jobmgr becomes leader 2017-04-24 2d0a066f Anant Vyas Change client to use respool path for creating resource pool 2017-04-25 71fd36a2 Jimmy Wu Deploy latest build 0.1.0-142-g82cb0de to irn1 peloton-devel01 2017-04-25 82cb0dea Jimmy Wu Use PriorityFIFO as default if SchedulingPolicy is not set in CreateRespoolRequest 2017-04-24 bb8394a7 Tengfei Mu wrap the maxtaskfailure 2017-04-24 23c243e6 Ankit Khera respool info dump 2017-04-24 3e7beaca Tengfei Mu retry scheduling failed tasks 2017-04-21 d7111873 Jimmy Wu Deploy latest build 0.1.0-133-g02407e1 to irn 2017-04-17 2d91001e Anant Vyas Add API implementation for `LookupResourcePoolID` 2017-04-21 92c3a0d8 Mayank Bansal Adding generic State machine 2017-04-20 6f682d4a Jimmy Wu Deploy latest build 0.1.0-133-g02407e1 to sjc1/dca1-devel01 2017-04-21 1dc47427 Zhitao Li Add a hostname to all pcluster containers. 2017-04-21 02407e1c Shi Lu Add null check in job runtime updater. 2017-04-21 cbc3c62e Ankit Khera query respools 2017-04-19 c923085d Tengfei Mu store offers with reserved resource in memory 2017-04-19 b998df02 Shi Lu Add jobruntime updater that periodically update the job runtime 2017-04-19 b46c0efc Ankit Khera Cluster capacity API 2017-04-18 0180b075 Shi Lu Modify job get API to return jobruntime 2017-04-18 e59a795a Jimmy Wu Fix Readme and added missing respool to all job configs 2017-04-18 369295af Anant Vyas Remove double initialization of root resource pool 2017-04-08 88aaee5e Min Cai Symlink peloton logs in container to /var/log/peloton on host 2017-04-17 bc19c6f6 Min Cai Add Aurora job file for peloton-devel01 cluster in IRN1 2017-04-18 8d3aceaa Jimmy Wu Support ENABLE_DEBUG_LOGGING override in pcluster 2017-04-18 b3108d16 Zhitao Li Add one more build tag to resmgr tests. 2017-04-18 68da40fa Anders Johnsen Add tasks query API with partial MySQL implementation 2017-04-18 560fd0bc Ankit Khera GetResourcePool - implement includeChildResPools 2017-04-14 c5bbed6f Anant Vyas Add API to lookup resource pool ID by path 2017-04-13 024b0f88 Zhitao Li Add new metrics for framework id and stream id get. 2017-04-11 52391154 Shi Lu Add job runtime into storage layer 2017-04-12 a8a7afbb Anant Vyas Refactor `restree` to use ResPool interface 2017-04-13 e09d621a Mayank Bansal Recovery from failure for resource Manager 2017-04-13 5a8fb41a Jimmy Wu Make jobmgr properly handle nil jobID in create job request 2017-04-13 bc467b9c Zhitao Li Create test stub to get accurate test coverage data. 2017-04-12 fc2e98e4 Ankit Khera added protobuf definitions for ClusterCapacity 2017-04-13 83cd4c14 Jimmy Wu Extend CLI to support zk based service discovery 2017-04-12 a1889d8b Ankit Khera fix respool creating validate policy 2017-04-10 d2ae0303 Jimmy Wu Enable use_cassandra for AA 2017-04-11 6aff966d Mayank Bansal Adding validation for resource pool 2017-04-11 97e8cf01 Mayank Bansal Adding support for Enqueue to respool and dequeue from ready queue 2017-04-03 feb9728e Anant Vyas Synchronize access to `ResPool` 2017-04-10 d37cf3e8 Jimmy Wu Enable use_cassandra for aa 2017-04-07 0490b1ed Jimmy Wu Deploy 0.1.0-103-g4d2b693 to sjc1/dca-devel01 and aa 2017-04-03 4d2b6938 Zhitao Li Properly handling Mesos disconnection in hostmgr.Server. 2017-04-07 0b27fe1e Mayank Bansal Fixing error not defined issue 2017-04-06 c31dfbfd Shi Lu Add logic to make sure that task list always returns the tasks in job instanceRange 2017-04-06 b6af6c07 Mayank Bansal Adding GetTasksByHosts Api for placement engine to query Resource Manager 2017-04-06 3ac1a83f Shi Lu Update HM/JM/RM to use C* store in pcluster 2017-04-06 8540e475 Ankit Khera upgrade devel cluster peloton 0.1.0-96-gf759f5d 2017-04-06 f003eb2f Min Cai [API] Return child resource pools in GetResourcePool method 2017-04-06 f759f5d6 Ankit Khera fix nil pointer in respool config 2017-04-06 f07778be Jimmy Wu Make task launcher logging less verbose 2017-04-05 497bec51 Jimmy Wu Hook up jobmgr with leader election 2017-04-04 fdb7ffd2 Shi Lu Change job handler to only allow jobID to be UUID 2017-04-05 2773425c Mayank Bansal Fixing respool initialization 2017-04-05 8f21aec0 Tengfei Mu fix initial recon delay 2017-04-05 40bd1c33 Tengfei Mu enable explicit reconciliation periodically 2017-04-05 4325be9a Mayank Bansal Making placement engine multi threaded 2017-04-05 15d48fd0 Mayank Bansal Adding multi goroutine changes for job mger to launch tasks parallely 2017-04-04 47fcc07b Shi Lu Make change in the peloton master config to use c* store 2017-04-03 4121310b Ankit Khera Support create/update APIs 2017-03-30 9344fe9c Zhitao Li Clean up some mhttp and mpb package code and add some metrics. 2017-03-30 6569dc6e Shi Lu Rename config names, no more "stapi" 2017-03-30 39e4ea1c Shi Lu Rename mysql JobStore struct 2017-03-30 e8cb35e4 Casper Kejlberg-Rasmussen Label constraints API take two 2017-03-30 11cbd5a0 Anant Vyas Deploy `peloton:0.1.0-77-gcef015d` to devel01-dca1 2017-03-30 b1316ee2 Tengfei Mu change primary key in tasks and jobs table 2017-03-29 d85039b3 Zhitao Li Add a shared sandbox volume to the two tasks using a producer/consumer pattern. 2017-03-29 2cc405b0 Anant Vyas Deploy `peloton:0.1.0-77-gcef015d` to devel01-sjc1 2017-03-28 cef015dc Ankit Khera adding resource checks when insert/update the tree 2017-03-28 454329cf Jimmy Wu Make jobMgr.placement_dequeue_limit overrideable via env var and add more loggings 2017-03-28 fe94be67 Shi Lu Refactor all old stapi related code into cassandra package 2017-03-27 e02f8a79 Anant Vyas Cleanup Tree interface 2017-03-27 f6525975 Shi Lu Remove stapi.git dependency from glide.yaml 2017-03-27 b2909816 Min Cai Add job states and runtime info in Peloton API 2017-03-24 bd270690 Anant Vyas Register pprof endpoints 2017-03-24 0a97c8eb Shi Lu Step 1 of adding cassandra store apis that can replace STAPI library 2017-03-23 0f68f924 Zhitao Li Upgrade packer based vagrant for peloton. 2017-03-23 59effda8 Yunpeng Liu T777720 - add aurora deployment job config for mobileci01 2017-03-23 3ac9f0dc Zhitao Li Initial support for command based health check in peloton. 2017-03-22 97ca1b88 Mayank Bansal Placement Engine to place tasks and Job Mgr to Launch Tasks 2017-03-22 bc6851ab Ankit Khera respool proto changes to abstract errors, unit tests for create respool 2017-03-22 702b6830 Casper Kejlberg-Rasmussen Fix the pcluster command for Ubuntu Trusty Tahr 2017-03-22 027e6c77 Shi Lu Hookup hostmanger event buffer size to a config value 2017-03-22 b0970e5b Jimmy Wu Release 0.1.0-60-gcc8ab25 2017-03-22 ddb29fa6 Ankit Khera GetResourcePool fixes 2017-03-21 cc8ab256 Shi Lu In taskStateManager, if the circular buffer is full, do not return error 2017-03-21 c1a7c55d Shi Lu attempt 2 on event stream metrics 2017-03-20 3ef80e4d Shi Lu Revert "Add monitoring metrics for event stream handler / client" 2017-03-17 861959ef Shi Lu Add monitoring metrics for event stream handler / client 2017-03-20 e4255c3a Zhitao Li Revert "Revert "Add metrics to offer pool for amount of resources in ready and plac 2017-03-17 c63f07aa Zhitao Li Add an endpoint to overwrite logging level for peloton apps for a duration. 2017-03-15 90880421 Anant Vyas Implement ResourceManager.GetResourcePool 2017-03-16 0bc6392f Zhitao Li Create build tag and new makefile rule for real unit_test. 2017-03-17 6b4b472c Zhitao Li Turn off debug logging. 2017-03-20 e7264e97 Tengfei Mu upgrade peloton with latest fix 2017-03-20 63b97337 Tengfei Mu remove default stop range value 2017-03-17 f1370b23 Shi Lu temp fix for incorrect purgeOffset 2017-03-17 4925123e Tengfei Mu make reconciler start/stop no-op if already started/stopped 2017-03-17 89277081 Zhitao Li Revert "Add metrics to offer pool for amount of resources in ready and placing." 2017-03-17 e4bab672 Shi Lu adjust event stream client logging 2017-03-17 ea5825e2 Zhitao Li Add metrics to offer pool for amount of resources in ready and placing. 2017-03-17 7081f9f6 Tengfei Mu skip kill call to HM if current state is already terminal state 2017-03-17 d2e03f04 Jimmy Wu Bump up test db containers wait time 2017-03-16 c0edbb1a Jimmy Wu Pin hosts for AA/sjc1-devel01/dca1-devel01 2017-03-14 f0240e3b Tengfei Mu enable periodically implicit task reconciliation 2017-03-14 598b4e06 Min Cai Implement resource manager service APIs 2017-03-10 6bd965c7 Zhitao Li Finish kill task API. 2017-03-08 e7f2f47c Shi Lu Add logic in host manager to push status update events to resource manager 2017-03-09 08834292 Zhitao Li Add some tests for hostmgr.handler. 2017-03-09 b6259ae6 Zhitao Li Fix the label so UNS bridge can parse this. 2017-03-09 ab2349ca Jimmy Wu Update deployment docker image with latest build 2017-03-09 13ab1135 Jimmy Wu Persist task error messages and show in task get/list responses 2017-03-09 1250deb8 Tengfei Mu add task reconciliation upon hm gain leadership 2017-03-02 0a21daed Zhitao Li Sync mesos protobuf files to 1.2.x. 2017-03-08 3004b2f5 Zhitao Li Support mockgen through Makefile. 2017-03-08 27fd6adb Tengfei Mu Add tasks stop logic in JM side 2017-03-08 3c55ef6b Shi Lu In the all-in-one peloton master, start the TaskStatusUpdate 2017-03-08 ee13b22e Min Cai Move util/queue.go to common/queue/queue.go 2017-03-03 57401add Shi Lu Move the persist task state change logic into job manager 2017-03-08 bad9667d Jimmy Wu Set up inbounds for Placement engine 2017-03-08 785991bd Min Cai Refactor the main and handlers of master and 4 apps 2017-03-08 6b8e729f Gabe Conradi Fix deb build creating recursive symlinks 2017-03-08 66745a43 Jimmy Wu Fix and refresh armored_armadillo deployment files 2017-03-03 496dbfa2 Zhitao Li Support reservation and port picker. 2017-03-07 52f565c4 Mayank Bansal Adding Enqueue tasks with respoollookup api's 2017-03-06 60b69c5b Jimmy Wu Turn off use_stapi for master 2017-03-06 715bd56b Jimmy Wu Make placement.max_placement_attempts configurable 2017-03-03 148eec12 Ankit Khera adding stapi support for resource pool 2017-03-03 a63b12c9 Min Cai Rename master/cli/main.go to master/main/main.go 2017-03-03 a05dcf15 Ankit Khera changing column name for respools 2017-03-03 0b41d1ea Mayank Bansal Moving api's to use resmgr task 2017-03-01 f6029176 Zhitao Li Clean up HM/PE integration interface. 2017-03-03 4312c58a Shi Lu Add task event stream to host manager and consumer in job manager 2017-03-03 1ae69e9a Zhitao Li Fix status after launch task. 2017-03-02 2b39ed14 Jimmy Wu Add 'pcluster-teardown' rule and support launching peloton via 'make pcluster' 2017-03-03 1f81f378 Mayank Bansal Adding Task Scheduler, Ready, Placing and Placed Queue 2017-03-03 dc353a8d Shi Lu Add GetEventProgress() to EventHandler interface. This value would determine the pu 2017-03-02 057ead21 Shi Lu Add client implementation for task event stream T758091 2017-03-02 f04a5e79 Justin Pinkul Adding Opus configuration 2017-03-02 9fbe3b35 Jimmy Wu Extend pcluster to create cassandra keystore 2017-03-02 16d569f4 Ankit Khera Migration - adding resource pool materialized view to filter based on owner 2017-02-27 f7fa98f3 Zhitao Li Track offers being used in placement. 2017-03-01 8d2bd5dc Jimmy Wu Set CONFIG_DIR for pcluster.peloton-master 2017-03-01 b5bbe070 Jimmy Wu Minor fix in build-docker.sh 2017-03-01 6b5724ae Gabe Conradi document new release tag 2017-03-01 ea851594 Gabe Conradi Make build work for uber build environment without go env 2017-03-01 58710609 Ankit Khera Migration - adding respool table 2017-03-01 6776c32b Jimmy Wu Fix pcluster to comply with the new docker build changes 2017-02-28 5358b626 Gabe Conradi Improve docker, deb building, and jenkins integration 2017-02-28 db4e201c Mayank Bansal Fixing resmgr for passing correct policy 2017-02-24 67e6337b Shi Lu Add RPC handler impl for task update event stream 2017-02-27 89a6bb15 Min Cai Add debug information when building Peloton executables 2017-02-27 723fdbb0 Mayank Bansal Adding Fifo queue in Resource manager for pending Fifo queue 2017-02-27 0d446bd8 Min Cai Add Resource Manager private API 2017-02-24 2f5d9716 Zhitao Li Fix placement engine stop condition and empty UUID ack. 2017-02-24 53a29fac Shi Lu Add protocol buffer for task update event stream service 2017-02-23 89c47f2e Zhitao Li Stop faking gpus in pcluster. 2017-02-23 91ac4c7e Zhitao Li Add more logging to PE/HM integration. 2017-02-23 083c5db1 Zhitao Li Fix right side index. 2017-02-23 f2fa9686 Zhitao Li Add test job for gpu on mesos containerizer. 2017-02-23 0216fe9e Zhitao Li Fix typo on aurora. 2017-02-23 a17cca52 Zhitao Li Set debug level logging and enable prometheus exporter. 2017-02-23 5cc40c3d Shi Lu Add circular buffer implemetation struct 2017-02-22 76379fe2 Zhitao Li Make placement engine use `GetHostOffers` API. 2017-02-22 735533c6 Jimmy Wu Revert "Switch to percona mysql for production deployment" 2017-02-22 0d0d0090 Yunpeng Liu T738875 - add aurora jobs for devel deployment 2017-02-22 85f5dbb4 Jimmy Wu Switch pcluster to use percona mysql 2017-02-22 16538a25 Jimmy Wu Switch to percona mysql for production deployment 2017-02-21 76a720e4 Jimmy Wu Fix broken deployment aurora job file 2017-02-14 58a18746 Zhitao Li Define constraints for GetHostOffersRequest. 2017-02-21 630422b5 Jimmy Wu Hook up master with resmgr 2017-02-17 f2d0ba56 Jimmy Wu Fix all benchmark/job_configs to use new cpulimit config name in resource config 2017-02-21 30408160 Shi Lu Fix compile error 2017-02-20 f456736d Shi Lu Add config for peloton master to optionaly use the stapi store 2017-02-20 775d207b Yunpeng Liu T708751 - update docker image 2017-02-20 255896d3 Yunpeng Liu T708751 - Add cluster specific Aurora scripts for Peloton 2017-02-20 4362e7db Jimmy Wu minor fix indentation for placement production.yaml 2017-02-20 9f361047 Jimmy Wu Make placement engine use hostmgr.LaunchTasks api to launch tasks 2017-02-20 b5f2a0c8 Mayank Bansal Adding priority queue 2017-02-20 b9db625e Min Cai Add cluster specific Aurora scripts for Peloton 2017-02-20 c86a1506 Gabe Conradi make master and resmgr use peer.SmartChooser 2017-02-17 d9774b33 Zhitao Li Use nested message for error. 2017-02-17 46f66f46 Gabe Conradi Use SmartChooser in job manager for host and resource manager peers 2017-02-16 6a054e09 Gabe Conradi Make placement engine use service discovery to find hostmgr and jobmgr 2017-02-15 757e3de6 Zhitao Li Support generate tasks if GPU is specified. 2017-02-15 79083561 Mayank Bansal Adding resource hierarchy 2017-02-13 81b36528 Zhitao Li Support new `InternalHostService.GetHostOffers` API from hostmgr. 2017-02-14 2aa6ad27 Jimmy Wu Extend pcluster to support launching Peloton 2017-02-14 547ab86f Gabe Conradi Create zk leader election and observer intefaces 2017-02-09 40ad3472 Shi Lu Add option in peloton master to use stapi store 2017-02-09 fa1d0d58 Zhitao Li Fix for incorrect parse function. 2017-02-09 a1ed20d9 Jimmy Wu Small fixes for issues caught in Benchmark 2017-02-07 3cc5eef5 Zhitao Li Implement LaunchTasks API in hostmgr. 2017-02-08 3606710f Jimmy Wu [Placement Engine] Switch to resMgr api to dequeue tasks and hostMgr api to get off 2017-02-07 1da373bc Min Cai Add an internal host service API for hostmgr 2017-02-07 66fffbca Shi Lu make sure stapi store implements all the storage store interfaces 2017-02-07 28d83573 Shi Lu Add common function to init root metric scope and server mux 2017-02-06 95b85675 Shi Lu Add config on job manager to talk to resmgr 2017-02-06 3e04f044 Mayank Bansal Moving respool handler under respool 2017-02-06 6e01ce61 Shi Lu Add config for job manager, also hookup taskService APIs to job manager 2017-02-06 6f2e9dd4 Zhitao Li [hostmgr] Store Task status update into storage. 2017-02-06 b88e4db9 Mayank Bansal Moving task queue to resmgr 2017-02-06 6a397355 Jimmy Wu Make cassandra test container listen on different port than pcluster 2017-02-06 054f7fe4 Jimmy Wu Replace 'COMPONENT' with 'APP' in docker entrypoint 2017-02-03 c9c10535 Jimmy Wu Move Placement Engine into separate binary 2017-02-02 7e99859d Shi Lu First attempt to add job mananger binary 2017-02-06 867fd333 Jimmy Wu Update all job configs following "instance specific task config" protobuf changes 2017-02-05 a7adf61d Min Cai Move offerpool/taskqueue APIs to peloton/private 2017-02-05 712d3b58 Min Cai Move public API files to protobuf/peloton/api 2017-02-03 b7fbd0b4 Min Cai Add support for instance specific task config 2017-02-01 4f90e884 Jimmy Wu Fix bug in batch task insertions 2017-02-01 c5c3cc2b Zhitao Li Make hostmgr able to connect to Mesos master as a scheduler. 2017-02-02 e04472bb Mayank Bansal Fixing the configs 2017-02-02 69d168af Zhitao Li Advertise host ip so zookeeper is consistent. 2017-02-02 b5a66a81 Mayank Bansal Adding resource manager as different binary 2017-02-01 4795f72f Zhitao Li Move config parsing to a util function. 2017-02-01 c771a091 Zhitao Li Move master's offer and mesos package into hostmgr. 2017-02-01 dab40920 Jimmy Wu Make JobManager.Query return all jobs if no label is provided 2017-02-01 2385806d Zhitao Li Rename service interface to `HostService`. 2017-02-01 4930520d Zhitao Li fix build caused by conflict migration sequence id. 2017-01-30 12968fdb Shi Lu Make mesos task id unique by adding uuid part of it 2017-01-31 becd7322 Jimmy Wu Extend build-docker script to support ATG release 2017-01-31 6f0c9629 Jimmy Wu Extend docker entrypoint to run binary by component name 2017-01-27 917713d8 Mayank Bansal Adding Resource Manager RPC server , client and sql store 2017-01-27 ff26fb35 Jimmy Wu Increase resource limits for benchmark job configs 2017-01-26 02e23e8f Shi Lu Adjust a timeout 2017-01-26 2748a339 Shi Lu Ack mesos task update in a async manner. 2017-01-23 a5f197d3 Zhitao Li Add initial skeleton of hostmgr API and an empty server. 2017-01-25 5fd4106b Gabe Conradi handle mysql deadlocks with retry 2017-01-25 49cfeaa8 Gabe Conradi fix off by one error on task creation 2017-01-24 9cfd682d Gabe Conradi Batch update/inserts for mysql and stapi 2017-01-23 506a6bc8 Jimmy Wu Add uatc related test job configs 2017-01-23 02bbf2e4 Zhitao Li oneline fixer for shebang. 2017-01-23 865b034c Zhitao Li Add a test job yaml for using unified containerizer. 2017-01-18 959b9662 Shi Lu jumbo offer p1: get offers by # of hosts, and take all offers on that host 2017-01-17 8e27f3e6 Mayank Bansal Adding Resource Manager Api's 2017-01-16 b01ebd4b Shi Lu remove offer consolidation logic 2017-01-13 289a21c4 Shi Lu set max conn life time for mysql connections 2017-01-13 1bf1c2dd Jimmy Wu Add more debug tools to docker image 2017-01-13 f91b397a Shi Lu Set mysql DB max conn according to DB concurrency 2017-01-11 b7e35cdc Shi Lu parallelizing task status update processing 2017-01-11 d0aa33a6 Mayank Bansal Adding pyyaml coomand and fixing read me 2017-01-10 a2f0a8ae Shi Lu revisit the parrellized db create tasks logic 2017-01-10 7ec17e19 Shi Lu peloton master would take more offers from leader each time 2017-01-10 3eb8a65c Anant Vyas Move packaging and pcluster scripts to tools 2017-01-10 8396372d Shi Lu add offer_dequeue_limit config to scheduler 2017-01-05 eb8e8339 Shi Lu Add task state change records tracking based on peloton C* storage 2017-01-05 023fb97a Shi Lu add a log line calculating all the time spent in creating tasks 2017-01-04 2f381df3 Jimmy Wu Fix flag --env in peloton main.go 2017-01-04 387ddee7 Shi Lu Temp work to make the create job logic async. 2017-01-04 e30b6499 Jimmy Wu Fix peloton container entrypoint script to provide --config flag 2017-01-04 4c910f98 Shi Lu Add c* store for storage backend 2017-01-04 2c6ca4f4 Gabe Conradi T670026 add prom metrics exposition endpoint 2016-12-30 cce88021 Gabe Conradi Remove some go-common dependencies and improve master CLI interface 2016-12-29 02324616 Gabe Conradi make test container script block until test containers are listening 2016-12-29 c776345d Gabe Conradi Instrument master and storage layers 2016-12-27 ce71d3d0 Gabe Conradi instrument job and task API with metrics 2016-12-22 02a287d4 Shi Lu start c* container before make test and jenkins test 2016-12-22 6cc7a6f3 Shi Lu Update build deb script to use go 1.7.3 2016-12-21 44ec9ee7 Shi Lu Update yarpc to a version that required by stapi 2016-12-20 ca3817dc Shi Lu fix a typo 2016-12-20 1da0eb86 Shi Lu add frameworkinfo GPU capability plumbing 2016-12-20 aa9543cb Gabe Conradi Refactor peloton CLI 2016-12-19 b1964f97 Anant Vyas Refactor peloton-dev container scripts 2016-12-15 bd8d78ae Jimmy Wu Refactored pcluster to support to launching Cassandra and Peloton containers 2016-12-15 acece86a Anant Vyas Make .tmp delete-able by jenkins 2016-12-15 95981b24 Anant Vyas Remvoe redundant make task 2016-12-14 3fc2f45f Gabe Conradi Update docker scripts to facilitate cross-platform testing 2016-12-05 5da579c5 Anant Vyas Refactor jenkins to run tests inside a peloton container 2016-11-22 3fae81db Shi Lu Adding protocol binary encoding option for mesos requests 2016-12-07 a2b1b5da Min Cai Update peloton.aurora to support leader election 2016-12-08 5ff00eb0 Jimmy Wu port boostrap.sh to python 2016-11-17 8e844823 Anant Vyas Add jenkins target 2016-11-23 a483c0a6 Shi Lu peloton follower should not start the mesos loop ticker 2016-11-22 658f3f85 Jimmy Wu support launching tasks in container 2016-11-21 6674d673 Shi Lu add logic to restart mesos master subscription 2016-11-18 11945afe Shi Lu add logic to reject fragmented offer from same agent 2016-11-17 2ece12da Jimmy Wu fix failed pool_test 2016-11-17 aa8543d7 Jimmy Wu Refactor offerpool APIs 2016-11-17 81db14bc Jimmy Wu Improvement on synchronization mechanism && enable offer pruner 2016-11-16 bac253c0 Jimmy Wu Added OfferPruner for OfferPool 2016-11-10 c9597123 Jimmy Wu Deprecate 'role' flag based leader election code 2016-11-09 af3c0e30 Anant Vyas Add lint and jenkins targets 2016-11-08 78ce9211 Min Cai Don't crash if task is not found 2016-11-03 5b2c8979 Min Cai Make log level configurable via env variable 2016-11-07 044b3333 Min Cai Add aurora job file for deploying Peloton in LMT cluster 2016-11-07 d9040d56 Jimmy Wu Use zkDetector to get leading mesos master addr in leader election 2016-11-04 1d4981e0 Shi Lu Use ZK election to drive initialization of peloton master components. 2016-11-04 e3d93889 Anant Vyas Add test target 2016-11-04 88a5e12a Jimmy Wu Provide ip:port info in leader election 2016-11-04 fed5ebfd Shi Lu Add functionality to switch to different msos master on a single mhttp inbound 2016-11-04 6d205da2 Jimmy Wu Hook up leader election code into main 2016-11-03 efccd12c Min Cai Add configurable scheduler.task_dequeue_limit param 2016-11-02 1fa46f33 Jimmy Wu Recover from panic properly in scheduler 2016-11-02 755be352 Min Cai Disable kafka topic in production.yaml 2016-11-02 696ad6b7 Jimmy Wu install python and fix db port for prod 2016-11-01 b7ebfba0 Jimmy Wu Some enhancements to README 2016-10-31 ce3ee08b Shi Lu Add funtionality to re-fill the taskqueue from db store state. This will be called 2016-11-01 34696603 Jimmy Wu Detect Mesos leading master location from zk 2016-11-01 286cae86 Shi Lu Add function to change the http outbound url 2016-10-31 5711013a Shi Lu Acknowledge task state updates 2016-10-31 3bd1b900 Jimmy Wu Provide Hostname and Principal in mesos inbound PrepareSubscribe() call 2016-10-24 3a468556 Anant Vyas Add leader election code using zookeeper 2016-10-21 8e822eae Min Cai Cleanup logs for demo 2016-10-21 785d66b7 Shi Lu djust some parameters for demo 2016-10-21 16f3be31 Shi Lu add recover code to cope with race condition between scheduler and peloton inbound 2016-10-20 eb76c653 Jimmy Wu Changes to not leak docker images for prod 2016-10-21 b6c46c7c Shi Lu add db config for prod 2016-10-21 f8b13ced Shi Lu Hookup task update with dispatcher 2016-10-21 9b8e9571 Jimmy Wu Add log_dir for Mesos container 2016-10-21 4bd89359 Shi Lu hookup the distributed task queue and offer manager to the scheduler 2016-10-20 6604ef4f Min Cai Add mesos http outbound in yarpc transport 2016-10-20 fc956104 Shi Lu Add impl for peloton task queue RPC implememtation 2016-10-20 928390a0 Min Cai Refactor yarpc initialization in Peloton master 2016-10-20 c8648d8e Min Cai Add offer pool to support multiple Peloton masters 2016-10-19 1adea750 Jimmy Wu Merge peloton-packaging into peloton 2016-10-18 80af3c89 Jimmy Wu Support multi mesos agents in docker dev environment 2016-10-18 5e29f591 Min Cai Add a simple task queue API 2016-10-18 06121aea Min Cai Restructure leader/follower yaml files for pleoton master 2016-10-17 46e93334 Jimmy Wu Removed role whitelisting in mesos master container 2016-10-17 d16a9c6f Shi Lu leader follower logic for peloton master 2016-10-14 5b3b5a34 Jimmy Wu Fix production.yaml 2016-10-14 1c102673 Jimmy Wu fix README format 2016-10-14 46512c33 Jimmy Wu Added script to automate local build 2016-10-07 941afc27 Jimmy Wu Changes to support launching Peloton depdency containers on both Laptop and Dev ser 2016-10-06 b4ae6f77 Jimmy Wu Added scripts to launch dependency containers for Peloton dev/integration tests env 2016-10-03 923ed4b7 Shi Lu Refactoring -- Move launch task code into taskmanager 2016-09-30 69a90f96 Shi Lu Add commandline info as part of jobconfig 2016-09-28 49c0630e Shi Lu Adding offer queue and task queue. Also adding logic to launch tasks when offer is 2016-09-26 682dbb50 Shi Lu Adding a caller class that uses the http.Outbound to invoke mesos functions. The ca 2016-09-22 305017d5 Shi Lu Add an mhttp outbound that can be used to send protocol buffer message to mesos mas 2016-09-19 09d2d160 Shi Lu pretty print the cli response 2016-09-13 8c4181ae Shi Lu temp change the "oneof" usgae in the protocol buffer to unblock the peloton CLI cre 2016-09-15 70bcb42b Jimmy Wu Remove redundant broken imports from glide.yaml 2016-09-13 5faa18f4 Min Cai Add Vagrant and packer scripts for Peloton dev environment 2016-09-12 65818181 Shi Lu fix the master port, use strconv function to change a int to string correctly 2016-09-09 fee98099 Shi Lu temporary skip first 2 args before parsing flags, due to that golang flag parsing m 2016-09-08 2e2ed018 Shi Lu Add peloton client code 2016-09-07 8599c684 Min Cai Add yarpc modules for Mesos HTTP API 2016-09-01 bac15d14 Shi Lu Hookup store interface with peloton master functions 2016-08-29 ed406ee2 Shi Lu minor change 2016-08-29 1b57c110 Shi Lu Add task supprt for the peloton POC storage 2016-08-25 e4c243ce Shi Lu initial commit for peloton storage interface using local mysql 2016-08-07 4ee55e38 Min Cai Upgrade to latest yarpc package 2016-08-07 4b4a2c96 Min Cai Add peloton and mesos/v1 to ignore packages in glide.yaml 2016-07-08 22e75a5d Zhitao Li Add protoc related package into glide.yaml. 2016-06-16 ac8610aa Min Cai Tweak makefile and remove python code 2016-06-16 22732bb3 Min Cai Remove go-build submodule 2016-06-15 a516a2d9 Min Cai Fix wrong tab format 2016-06-14 678d5bfe Min Cai Setup YARPC handlers for Peloton services 2016-06-13 b0a2f1df Min Cai Change Peloton API to use ProtoBuf 2016-06-03 5918a637 Min Cai Add mesos protobuf files 2016-05-27 d58cd4f9 Min Cai Initial draft of the Peloton API 2016-05-24 4fa0890c Min Cai Setup golang environment for Peloton master/scheduler/executor 2016-04-25 4914397f Ashwin Murthy Weaver generated go files for peloton agent and master 2016-01-19 178e1ce1 Min Cai Initialize server with clay-tornado 2016-01-15 86e903c2 Min Cai Add Thrift RPC support for master/agent 2016-01-15 86aa610a Min Cai Add custom pre_build_cmds 2016-01-14 f041075f Min Cai Add custom build command for peloton 2016-01-14 e3e7f870 Min Cai Update the new backend/frontend ports 2016-01-14 38dd52db Min Cai Add pinocchio file for peloton-master 2016-01-08 cd354750 Min Cai Address more comments 2016-01-07 489328f1 Min Cai Address Kristian's comments 2016-01-07 c1e67465 Min Cai Initial thrift API for goal state service and config bundle 2016-01-07 087db452 Min Cai Add .arcconfig file 2016-01-07 d5bfa344 Min Cai Initial commit
Commit: | e346a95 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Add HTTP health check support to Peloton Summary: Resolves T2574589 Test Plan: Tested using mini cluster. Will add a canary test for this in a subsequent commit after API release. Reviewers: O521 Project Peloton: Add reviewers, varung Reviewed By: O521 Project Peloton: Add reviewers, varung Subscribers: jenkins Maniphest Tasks: T2574589 Differential Revision: https://code.uberinternal.com/D2551443
Commit: | aabbb22 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Add instance workflow events with list job updates Summary: Add update limit and flag to get instance workflow events as a part for list job updates Reviewers: apoorvaj, zhixin, codyg, kevinxu, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2509513
Commit: | cc206b6 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Cleanup of v1alpha APIs Summary: 1. Move job restart related configuration to RestartSpec. 2. Rename ListJobUpdates to ListJobWorkflows return both updates and restarts. This fixes the leftover comments from https://code.uberinternal.com/D2320399. Reviewers: zhixin, min, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2354195
Commit: | 310a8df | |
---|---|---|
Author: | Adam Backer | |
Committer: | Adam Backer |
Populate reason field for deadline exceeded tasks Summary: Peloton users would like a way to differentiate between a task that was killed administratively (requested by user) and one where the timeout has been exceeded. Parent task: https://code.int.uberatc.com/T262050 Test Plan: Run test job with a timeout set below the runtime, evaluate job status object returned from peloton API for matching `reason:` Reviewers: O521 Project Peloton: Add reviewers, zhixin, amitbose Reviewed By: O521 Project Peloton: Add reviewers, amitbose Subscribers: avyas, sachins, apoorvaj, jenkins, nick.cobb Maniphest Tasks: T262050 Differential Revision: https://code.uberinternal.com/D2508517
Commit: | 0444d10 | |
---|---|---|
Author: | Apoorva Jindal | |
Committer: | Apoorva Jindal |
Maintain a map of entity version to task count for stateless jobs in job status Summary: Resolves T2493281 Test Plan: Tested using minicluster. Reviewers: zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2493281 Differential Revision: https://code.uberinternal.com/D2481169
Commit: | 59fbc9b | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Add create control flags in stateless job create API Summary: Creating job behaves like update from user's perspective now. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless create /DefaultResPool ~/Desktop/testSpec.yaml 1 --start-paused Job e69a92d3-8a10-425b-8aed-7c2863d8973c created. Entity Version: 1-1-1 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless workflow resume e69a92d3-8a10-425b-8aed-7c2863d8973c 1-1-1 Workflow resumed. New EntityVersion: 1-1-2 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get e69a92d3-8a10-425b-8aed-7c2863d8973c --summaryonly instance_count: 3 status: creation_time: 2019-01-31T23:09:36.1175994Z desired_state: JOB_STATE_RUNNING pod_stats: DELETED: 0 FAILED: 0 INITIALIZED: 0 KILLED: 0 KILLING: 0 LAUNCHED: 0 LAUNCHING: 0 LOST: 0 PENDING: 0 PLACED: 0 PLACING: 0 PREEMPTING: 0 READY: 0 RUNNING: 0 STARTING: 0 SUCCEEDED: 0 UNKNOWN: 0 revision: created_at: "1548976176117599400" updated_at: "1548976176891843100" updated_by: "" version: "2" state: JOB_STATE_PENDING version: value: 1-1-1 workflow_status: creation_time: "" num_instances_completed: 0 num_instances_failed: 0 num_instances_remaining: 3 prev_state: WORKFLOW_STATE_INVALID prev_version: value: "0" state: WORKFLOW_STATE_PAUSED type: WORKFLOW_TYPE_UPDATE update_time: "" version: value: "1" (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get e69a92d3-8a10-425b-8aed-7c2863d8973c --summaryonly instance_count: 3 respool_id: value: 51c6dd17-5c2a-4883-b5f7-2c2dc0400170 status: creation_time: 2019-01-31T23:09:36.1175994Z desired_state: JOB_STATE_RUNNING pod_stats: PENDING: 1 RUNNING: 1 revision: created_at: "1548976176117599400" updated_at: "1548976304466562900" updated_by: "" version: "5" state: JOB_STATE_RUNNING version: value: 1-1-2 workflow_status: creation_time: "" instances_current: - 1 num_instances_completed: 1 num_instances_failed: 0 num_instances_remaining: 2 prev_state: WORKFLOW_STATE_INVALID prev_version: value: "0" state: WORKFLOW_STATE_ROLLING_FORWARD type: WORKFLOW_TYPE_UPDATE update_time: "" version: value: "1" (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless get e69a92d3-8a10-425b-8aed-7c2863d8973c --summaryonly instance_count: 3 respool_id: value: 51c6dd17-5c2a-4883-b5f7-2c2dc0400170 status: creation_time: 2019-01-31T23:09:36.1175994Z desired_state: JOB_STATE_RUNNING pod_stats: RUNNING: 3 revision: created_at: "1548976176117599400" updated_at: "1548976308950379700" updated_by: "" version: "7" state: JOB_STATE_RUNNING version: value: 1-1-2 workflow_status: creation_time: "" num_instances_completed: 3 num_instances_failed: 0 num_instances_remaining: 0 prev_state: WORKFLOW_STATE_INVALID prev_version: value: "0" state: WORKFLOW_STATE_SUCCEEDED type: WORKFLOW_TYPE_UPDATE update_time: "" version: value: "1" ``` Reviewers: apoorvaj, codyg, O521 Project Peloton: Add reviewers Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2164869 Differential Revision: https://code.uberinternal.com/D2467761
Commit: | cafb3d1 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Pass in-place update hint between components Summary: The diff passes hints for in-place update between components (except for event stream part). Then main purpose of the diff is to determine the API of each component, once API is finalized, details in each component would be fleshed out. Current workflow: # If in-place is set in UpdateConfig, UpdateRun will set `desiredHost` before task kill # When jobmgr kills the task and `desiredHost` is set, jobmgr calls hostmgr KillAndReserveTasks instead of normal kill # When a task is started and `desiredHost` is set in task runtime, jobmgr would enqueue to resmgr with the task runtime # ResMgr receives enqueue gang request and convert the `DesiredHost` field in task runtime to rmTask # PE calls dequeue gangs, and create HostFilters with `DesiredHost` as a hint # hint is passed as part of `HostFilter` when PE calls `Acquire` on hostmgr. Test Plan: minicluster test shows that hint can be successfully passed into hostmgr to acquire offer. ``` {"level":"info","msg":"AcquireHostOffers called.","request":{"filter":{"resourceConstraint":{"minimum":{"cpuLimit":0.1,"memLimitMb":2,"diskLimitMb":10}},"quantity":{"maxHosts":1},"hint":{"hostname":["peloton-mesos-agent1"]}}},"time":"2019-01-29T23:26:35Z"} {"level":"info","msg":"AcquireHostOffers called.","request":{"filter":{"resourceConstraint":{"minimum":{"cpuLimit":0.1,"memLimitMb":2,"diskLimitMb":10}},"quantity":{"maxHosts":1},"hint":{"hostname":["peloton-mesos-agent1"]}}},"time":"2019-01-29T23:26:35Z"} {"level":"info","msg":"Task kill request sent","task":{"value":"60bd7666-6f44-4126-ac6e-8090de8a1690-1-1"},"time":"2019-01-29T23:26:37Z"} ``` Reviewers: O521 Project Peloton: Add reviewers, apoorvaj, varung, avyas Reviewed By: O521 Project Peloton: Add reviewers, apoorvaj Subscribers: jenkins Maniphest Tasks: T2466573 Differential Revision: https://code.uberinternal.com/D2458017
Commit: | 25654bd | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Refactor workflow info & workflow status Summary: - Deprecate GetJobUpdate API - Add more update information to workflow info and workflow status, information includes timestamps, instances added, instances updated and instances removed, and previous workflow state. Reviewers: apoorvaj, O521 Project Peloton: Add reviewers, zhixin Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers, zhixin Subscribers: zhixin, codyg, jenkins Differential Revision: https://code.uberinternal.com/D2427485
Commit: | 262c58d | |
---|---|---|
Author: | Amit Bose | |
Committer: | Amit Bose |
Add TerminationStatus to task/pod runtime status Summary: Define v0 and v1alpha API changes for TerminationStatus Test Plan: UTs Reviewers: O521 Project Peloton: Add reviewers, #peloton, apoorvaj, avyas Reviewed By: O521 Project Peloton: Add reviewers, #peloton, apoorvaj Subscribers: jenkins Tags: #peloton Maniphest Tasks: T2373405 Differential Revision: https://code.uberinternal.com/D2399817
Commit: | 61384ae | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Implement JobService.DeleteJob v1aplha API Test Plan: ``` $ bin/peloton job stateless create /DefaultResPool example/stateless/testspec.yaml Job 9535afd0-954d-4eef-b6eb-b443bc3a5113 created. Entity Version: 1-1-1 $ bin/peloton job query ID| Name| Owner| State| Creation Time| Completion Time| Total| Running| Succeeded| Failed| Killed| 9535afd0-954d-4eef-b6eb-b443bc3a5113| TestSpec| testTeam| RUNNING| 2019-01-04T00:19:21Z| --| 3| 3| 0| 0| 0| $ bin/peloton job stateless delete 9535afd0-954d-4eef-b6eb-b443bc3a5113 1-1-1 peloton: error: code:aborted message:job is not in terminal state $ bin/peloton task query 9535afd0-954d-4eef-b6eb-b443bc3a5113 Instance| Name| State| Healthy| Start Time| Run Time| Host| Message| Reason| 0| | RUNNING| DISABLED| 2019-01-04T00:19:25Z| 00:01:06| peloton-mesos-agent2| | | 1| | RUNNING| DISABLED| 2019-01-04T00:19:25Z| 00:01:06| peloton-mesos-agent2| | | 2| | RUNNING| DISABLED| 2019-01-04T00:19:25Z| 00:01:07| peloton-mesos-agent1| | | $ bin/peloton job stateless delete 9535afd0-954d-4eef-b6eb-b443bc3a5113 1-1-1 -f Job deleted $ bin/peloton task query 9535afd0-954d-4eef-b6eb-b443bc3a5113 Job 9535afd0-954d-4eef-b6eb-b443bc3a5113 was not found: Failed to find job with id value:"9535afd0-954d-4eef-b6eb-b443bc3a5113" , err=code:not-found message:job:9535afd0-954d-4eef-b6eb-b443bc3a5113 not found $ bin/peloton job query No jobs found. ``` Reviewers: zhixin, O521 Project Peloton: Add reviewers, O2101 Project Peloton: Resource Manager, avyas Reviewed By: zhixin, O521 Project Peloton: Add reviewers, O2101 Project Peloton: Resource Manager, avyas Subscribers: avyas, jenkins Tags: PHID-PROJ-ocgpuyhm2roke5y457li Maniphest Tasks: T2180673 Differential Revision: https://code.uberinternal.com/D2388633
Commit: | 32eab42 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Add GetJobUpdate API Summary: This change add public API to get job's update with previous and current job spec. Resolves T2247755 Reviewers: apoorvaj, kevinxu, zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2247755 Differential Revision: https://code.uberinternal.com/D2389393
Commit: | e693c53 | |
---|---|---|
Author: | Varun Gupta | |
Committer: | Varun Gupta |
Add implementation to read/write workflow events per instance Summary: - Add workflow events at storage read/write/delete api - Add workflow events progress per instance - Add GetWorkflow event api handler and cli Sample Output Storage {F99728543} CLI {F99798527} Test Plan: - unit test - pcluster - preprod Reviewers: apoorvaj, O521 Project Peloton: Add reviewers, zhixin Reviewed By: O521 Project Peloton: Add reviewers, zhixin Subscribers: kevinxu, avyas, jenkins Maniphest Tasks: T2247835 Differential Revision: https://code.uberinternal.com/D2371335
Commit: | 168c9e3 | |
---|---|---|
Author: | Zhixin Wen | |
Committer: | Zhixin Wen |
Implement JobSVC.Stop Summary: Add `stateVersion` and `desiredStateVersion` in job runtime. `desiredStateVersion` is also used as part of entity version. When stop a job, `desiredStateVersion` is incremented. goal state engine would check if `desiredStateVersion` is equal to `stateVersion`. If not, it would check job runtime goal state, and kill all the tasks if goal state is KILLED. (For job start, it would be a similar process, with goal state set to RUNNING) After all of the tasks are sent to be killed, `desiredStateVersion` will be set to `stateVersion`. Therefore, goal state engine would know there is no action needed if it is enqueued again. Test Plan: ``` (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless cache d62d9108-f3d2-4fe3-b95a-57a2bcce7db3 spec: description: "" instance_count: 10 name: "" owner: "" owning_team: "" respool_id: value: 2caf05db-14c8-4fcf-8287-d748b05e2b49 revision: created_at: "1545172526714228362" updated_at: "1545172526714228362" updated_by: "" version: "1" status: creation_time: 2018-12-18T22:35:26.726381081Z desired_state: JOB_STATE_RUNNING pod_stats: DELETED: 0 FAILED: 0 INITIALIZED: 0 KILLED: 0 KILLING: 0 LAUNCHED: 0 LAUNCHING: 0 LOST: 0 PENDING: 0 PLACED: 0 PLACING: 0 PREEMPTING: 0 READY: 0 RUNNING: 10 STARTING: 0 SUCCEEDED: 0 UNKNOWN: 0 revision: created_at: "1545172526726381081" updated_at: "1545172532294071997" updated_by: "" version: "6" state: JOB_STATE_RUNNING version: value: 1-1-1 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless stop d62d9108-f3d2-4fe3-b95a-57a2bcce7db3 peloton: error: required argument 'entityVersion' not provided, try --help (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless stop d62d9108-f3d2-4fe3-b95a-57a2bcce7db3 1-1-1 Job stopped. New EntityVersion: 1-2-1 (env) zhixin-C02SG079G8WM:peloton zhixin$ ./bin/peloton job stateless cache d62d9108-f3d2-4fe3-b95a-57a2bcce7db3 spec: description: "" instance_count: 10 name: "" owner: "" owning_team: "" respool_id: value: 2caf05db-14c8-4fcf-8287-d748b05e2b49 revision: created_at: "1545172526714228362" updated_at: "1545172526714228362" updated_by: "" version: "1" status: creation_time: 2018-12-18T22:35:26.726381081Z desired_state: JOB_STATE_KILLED pod_stats: DELETED: 0 FAILED: 0 INITIALIZED: 0 KILLED: 10 KILLING: 0 LAUNCHED: 0 LAUNCHING: 0 LOST: 0 PENDING: 0 PLACED: 0 PLACING: 0 PREEMPTING: 0 READY: 0 RUNNING: 0 STARTING: 0 SUCCEEDED: 0 UNKNOWN: 0 revision: created_at: "1545172526726381081" updated_at: "1545172552138362356" updated_by: "" version: "10" state: JOB_STATE_KILLED version: value: 1-2-1 ``` Reviewers: apoorvaj, O521 Project Peloton: Add reviewers, sachins Reviewed By: apoorvaj, O521 Project Peloton: Add reviewers Subscribers: jenkins Maniphest Tasks: T2346645 Differential Revision: https://code.uberinternal.com/D2372373
Commit: | 04ca9f6 | |
---|---|---|
Author: | Sachin Sharma | |
Committer: | Sachin Sharma |
Implement JobService.CreateJob API Test Plan: ``` $ bin/peloton job stateless create /DefaultResPool example/stateless/testspec.yaml Job 9e49f45f-da82-4fe9-b66c-8d031e675f4d created. Entity Version: 1-1 ``` ``` $ bin/peloton task query 9e49f45f-da82-4fe9-b66c-8d031e675f4d Instance| Name| State| Healthy| Start Time| Run Time| Host| Message| Reason| 0| | RUNNING| DISABLED| 2018-12-19T21:10:46Z| 00:00:09| peloton-mesos-agent1| | | 1| | RUNNING| DISABLED| 2018-12-19T21:10:46Z| 00:00:09| peloton-mesos-agent1| | | 2| | RUNNING| DISABLED| 2018-12-19T21:10:46Z| 00:00:09| peloton-mesos-agent2| | | ``` Reviewers: zhixin, O521 Project Peloton: Add reviewers Reviewed By: zhixin, O521 Project Peloton: Add reviewers Subscribers: adityacb, jenkins, apoorvaj Tags: #peloton Maniphest Tasks: T2180551 Differential Revision: https://code.uberinternal.com/D2364019
Commit: | 6738697 | |
---|---|---|
Author: | Kevin Xu | |
Committer: | Kevin Xu |
Support Thermos Executor in Peloton Summary: * Populate `ExecutorInfo` section when launching task for Mesos. * When launching mesos task `ContainerInfo` and `CommandInfo` section will be placed inside `ExecutorInfo` when custom executor is used. Reviewers: min, apoorvaj, varung, O521 Project Peloton: Add reviewers, codyg Reviewed By: apoorvaj, varung, O521 Project Peloton: Add reviewers Subscribers: jenkins Differential Revision: https://code.uberinternal.com/D2364363