Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

PB-159: remove weights from gRPC messages #298

Merged
merged 12 commits into from
Feb 20, 2020

Conversation

little-dude
Copy link
Contributor

@little-dude little-dude commented Feb 14, 2020

References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
StartTrainingRound request, and the coordinator response with the
same StartTrainingRoundResponse that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is <round_number>/<participant_id>.

Finally (4), the participant sends it's EndTrainingRequest. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

Important note: At the moment, the participants don't know their
ID, because the coordinator does not send it to them. Thus, they
currently generate a random ID when they start, and send it to the
coordinator so that it can retrieve the participant's weights. This is
why the EndTrainingRoundRequest currently has a participant_id
field.

    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round/global")  |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="round/participant")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="round/participant")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="round/participant")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round+1/participant")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |

Implementation notes:

  • Initially, we thought we would be using different buckets for the
    local and global weights. But for now, we use the same bucket for
    local and global weights for now

  • We currently store the global weights under different keys. It turns
    out that this brings un-necessary complexity so we'll probably
    simplify this in the future

  • For now, the coordinator doesn't send any storage information to the
    participants. Thus, the participants need to be configured with the
    storage information. In the future, the StartTrainingRoundResponse
    could contain the endpoint url, bucket name, etc.


Reviewer checklist

Reviewer agreement:

  • Reviewers assign themselves at the start of the review.
  • Reviewers do not commit or merge the merge request.
  • Reviewers have to check and mark items in the checklist.

Merge request checklist

  • Conforms to the merge request title naming XP-XXX <a description in imperative form>.
  • Each commit conforms to the naming convention XP-XXX <a description in imperative form>.
  • Linked the ticket in the merge request title or the references section.
  • Added an informative merge request summary.

Code checklist

  • Conforms to the branch naming XP-XXX-<a_small_stub>.
  • Passed scope checks.
  • Added or updated tests if needed.
  • Added or updated code documentation if needed.
  • Conforms to Google docstring style.
  • Conforms to XAIN structlog style.

little-dude added a commit that referenced this pull request Feb 14, 2020
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<participant_id>/<round_number>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does send it to them. Thus, they currently
generate a random ID when they start, and send it to the coordinator
so that it can retrieve the participant's weights. This is why the
`EndTrainingRoundRequest` currently has a `participant_id` field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round")         |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="participant/round")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="participant/round")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="participant/round")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round + 1")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from 218ecc2 to 4f17e1f Compare February 14, 2020 14:53
little-dude added a commit that referenced this pull request Feb 14, 2020
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<participant_id>/<round_number>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does send it to them. Thus, they currently
generate a random ID when they start, and send it to the coordinator
so that it can retrieve the participant's weights. This is why the
`EndTrainingRoundRequest` currently has a `participant_id` field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round")         |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="participant/round")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="participant/round")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="participant/round")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round + 1")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from 4f17e1f to 14304a4 Compare February 14, 2020 15:14
little-dude added a commit that referenced this pull request Feb 17, 2020
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<participant_id>/<round_number>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does send it to them. Thus, they currently
generate a random ID when they start, and send it to the coordinator
so that it can retrieve the participant's weights. This is why the
`EndTrainingRoundRequest` currently has a `participant_id` field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round")         |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="participant/round")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="participant/round")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="participant/round")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round + 1")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from 752a2f7 to 573dc12 Compare February 17, 2020 10:19
little-dude added a commit that referenced this pull request Feb 17, 2020
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<round_number>/<participant_id>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does not send it to them. Thus, they
currently generate a random ID when they start, and send it to the
coordinator so that it can retrieve the participant's weights. This is
why the `EndTrainingRoundRequest` currently has a `participant_id`
field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round/global")  |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="round/participant")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="round/participant")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="round/participant")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round+1/participant")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.

handle review comments
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from 7dabc27 to a114952 Compare February 17, 2020 13:45
little-dude added a commit that referenced this pull request Feb 17, 2020
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<round_number>/<participant_id>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does not send it to them. Thus, they
currently generate a random ID when they start, and send it to the
coordinator so that it can retrieve the participant's weights. This is
why the `EndTrainingRoundRequest` currently has a `participant_id`
field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round/global")  |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="round/participant")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="round/participant")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="round/participant")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round+1/participant")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from a114952 to fddbf56 Compare February 17, 2020 13:48
@atymoshchuk
Copy link
Contributor

Please rebase this branch

@little-dude
Copy link
Contributor Author

Please rebase this branch

I'll probably wait a little because there are other PRs in the pipe that will create more conflicts. The PR is reviewable as it is anyway.

little-dude added a commit that referenced this pull request Feb 18, 2020
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<round_number>/<participant_id>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does not send it to them. Thus, they
currently generate a random ID when they start, and send it to the
coordinator so that it can retrieve the participant's weights. This is
why the `EndTrainingRoundRequest` currently has a `participant_id`
field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round/global")  |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="round/participant")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="round/participant")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="round/participant")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round+1/participant")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from fddbf56 to 1770099 Compare February 18, 2020 08:53
atymoshchuk
atymoshchuk previously approved these changes Feb 18, 2020
tests/store.py Outdated Show resolved Hide resolved
janpetschexain
janpetschexain previously approved these changes Feb 19, 2020
finiteprods
finiteprods previously approved these changes Feb 19, 2020
Copy link
Contributor

@finiteprods finiteprods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few minor issues / questions. on a more general note, it does not look to me too difficult to have direct weights transfer (the previous behaviour) as just a "special case" of the present one. But this can be a separate issue.

tests/conftest.py Outdated Show resolved Hide resolved
xain_fl/coordinator/coordinator.py Outdated Show resolved Hide resolved
xain_fl/coordinator/coordinator.py Outdated Show resolved Hide resolved
xain_fl/coordinator/coordinator.py Show resolved Hide resolved
tests/test_coordinator.py Outdated Show resolved Hide resolved
tests/test_coordinator.py Show resolved Hide resolved
tests/test_grpc.py Show resolved Hide resolved
tests/test_grpc.py Outdated Show resolved Hide resolved
References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<round_number>/<participant_id>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does not send it to them. Thus, they
currently generate a random ID when they start, and send it to the
coordinator so that it can retrieve the participant's weights. This is
why the `EndTrainingRoundRequest` currently has a `participant_id`
field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round/global")  |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="round/participant")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="round/participant")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="round/participant")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round+1/participant")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
@little-dude little-dude force-pushed the PB-159-use-s3-for-transfering-weights branch from 952a0ee to bb61423 Compare February 20, 2020 08:16
@little-dude
Copy link
Contributor Author

I addressed the reviews and rebased.

Copy link
Contributor

@finiteprods finiteprods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good, thanks.

@rsaffi
Copy link
Contributor

rsaffi commented Feb 20, 2020

Regarding the changes in setup.py: after all the PRs related to removing weights from gRPC messages on xain-fl, xain-sdk and xain-proto are merged, the setup.py on all should actually be updated to point to development branch, correct?

@little-dude little-dude merged commit 05c150b into development Feb 20, 2020
@little-dude little-dude deleted the PB-159-use-s3-for-transfering-weights branch February 20, 2020 08:51
@rsaffi
Copy link
Contributor

rsaffi commented Feb 20, 2020

Regarding the changes in setup.py: after all the PRs related to removing weights from gRPC messages on xain-fl, xain-sdk and xain-proto are merged, the setup.py on all should actually be updated to point to development branch, correct?

Ah, I just saw your message on Slack already covering this. 🥇

little-dude added a commit that referenced this pull request Feb 20, 2020
little-dude added a commit that referenced this pull request Feb 20, 2020
* PB-159: update xain-{proto,sdk} dependencies to the right branch

Follow-up of #298

* PB-159: fix broken tests

This test broke due to a change in xain-sdk.

Ref: https://github.com/xainag/xain-sdk/pull/91/
little-dude added a commit that referenced this pull request Feb 25, 2020
* PB-159: remove weights from gRPC messages

References:

https://xainag.atlassian.net/browse/PB-159

Needs to be merged along with:

- https://github.com/xainag/xain-proto/pull/25
- https://github.com/xainag/xain-sdk/pull/88
- #298

Summary:

Remove the weights from the gRPC messages. From now on, weights will
be exchanged via s3 buckets.

The sequence diagram below illustrate this new behavior.

At the beginning of a round (1) the selected participants send a
`StartTrainingRound` request, and the coordinator response with the
same `StartTrainingRoundResponse` that does not contain the global
weights anymore.

Instead, the participant fetches these weights from the store (2). S3
buckets are key-value stores, and the key for global weights is the
round number.

Then, the participant trains. Once done, it uploads its local weights
to the S3 bucket (3). The key is `<round_number>/<participant_id>`.

Finally (4), the participant sends it's `EndTrainingRequest`. Before
answering, the coordinator retrieves the local weights the participant
has uploaded.

_**Important note**: At the moment, the participants don't know their
ID, because the coordinator does not send it to them. Thus, they
currently generate a random ID when they start, and send it to the
coordinator so that it can retrieve the participant's weights. This is
why the `EndTrainingRoundRequest` currently has a `participant_id`
field._

```
    P                                C                      Store
1.  |   StartTrainingRoundRequest    |                        |
    | -----------------------------> |                        |
    |   StartTrainingRoundResponse   |                        |
    | <----------------------------- |                        |
    |                                |                        |
    |                Get global weights (key="round/global")  |
2.  | ------------------------------------------------------> |
    |                         Global weights                  |
    | <------------------------------------------------------ |
    |                                |                        |
    | [train...]                     |                        |
    |                                |                        |
3.  |       Set local weights (key="round/participant")       |
    | ------------------------------------------------------> |
    |                               Ok                        |
    | <------------------------------------------------------ |
    |                                |                        |
4.  |   EndTrainingRoundRequest      |                        |
    | -----------------------------> | Get local weights (key="round/participant")
    |                                | ---------------------> |
    |                                | Local weights          |
    |  EndTrainingRoundResponse      | <--------------------> |
    | <----------------------------- |                        |
```

At the end of the round, the coordinator writes the weights to the s3
bucket, using the next upcoming round number as key (see the sequence
diagram below).

```
P                                C                      Store
|   EndTrainingRoundRequest      |                        |
| -----------------------------> | Get local weights (key="round/participant")
|                                | ---------------------> |
|                                | Local weights          |
|  EndTrainingRoundResponse      | <--------------------> |
| <----------------------------- |                        |
|                                |                        |
|                                | Set global weights (key="round+1/participant")
|                                | ---------------------> |
|                                | Ok                     |
|                                | <--------------------> |
```

Implementation notes:

- Initially, we thought we would be using different buckets for the
  local and global weights. But for now, we use the same bucket for
  local and global weights for now

- We currently store the global weights under different keys. It turns
  out that this brings un-necessary complexity so we'll probably
  simplify this in the future

- For now, the coordinator doesn't send any storage information to the
  participants. Thus, the participants need to be configured with the
  storage information. In the future, the `StartTrainingRoundResponse`
  could contain the endpoint url, bucket name, etc.
little-dude added a commit that referenced this pull request Feb 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants