Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - fix: release batches lock #2840

Closed

Conversation

davidbeesley
Copy link
Contributor

@davidbeesley davidbeesley commented Dec 1, 2022

scopes mutex guard in PartitionProducer so that mutex is released immediately instead being held while batches are sent.

Means that TopicProducer::send() returns quickly instead of blocking while batches are sent.

This change gives almost a 10x improvement on producer throughput when producing to the cloud. Note the regression in latency is beacause all messages get "sent" right away where as previously messages blocked on send while waiting for other batches to go out.

Config

matrix_name: ExampleMatrix
topic_name: benchmarking-baingmcvadyojtn
current_profile: cloud
timestamp: "2022-12-01T00:37:15.840379872Z"
worker_timeout:
  secs: 10000
  nanos: 0
num_samples: 100
duration_between_samples:
  secs: 0
  nanos: 300000000
num_records_per_producer_worker_per_batch: 1000
producer_batch_size: 16000
producer_queue_size: 100
producer_linger:
  secs: 0
  nanos: 10000000
producer_server_timeout:
  secs: 5
  nanos: 0
producer_compression: none
consumer_max_bytes: 64000
num_concurrent_producer_workers: 1
num_concurrent_consumers_per_partition: 1
num_partitions: 1
record_size: 1000
record_key_allocation_strategy: NoKey

Per Record E2E Latency

Variable p0.00 p0.50 p0.95 p0.99 p1.00
Latency 119.808ms 622.591ms 924.671ms 1.005567s 1.060863s

Throughput (Total Produced Bytes / Time)

Variable Min Median Max Description
Producer Throughput 2.165mb/s 2.658mb/s 3.052mb/s First Produced Message <-> Last Produced Message
Consumer Throughput 1.154mb/s 1.389mb/s 1.563mb/s First Consumed Message (First Time Consumed) <-> Last Consumed Message (First Time Consumed)
Combined Throughput 0.942mb/s 0.989mb/s 1.137mb/s First Produced Message <-> Last Consumed Message (First Time Consumed)

Comparision with previous results: cloud @ 2022-12-01 00:24:33.995062094 UTC

Variable Change Previous Current P-Value
Latency Worse 405.427ms 1.003788s 0.00000
Producer Throughput Better 0.288mb/s 2.626mb/s 0.00000
Consumer Throughput Better 0.249mb/s 1.384mb/s 0.00000
Combined Throughput Better 0.240mb/s 0.996mb/s 0.00000

Copy link
Contributor

@sehz sehz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there unit test?

@davidbeesley
Copy link
Contributor Author

davidbeesley commented Dec 1, 2022

Is there unit test?

I don't know if there was one already but I didn't add a new one. The diff looks much more complicated than the change is, as I just added { } around the code that actually uses the guard.

I don't think this change benefits from a unit test because the compiler guarantees no direct changes because the guard was never used outside of the new scope. If there is a weird side effect on code running in another thread that was being prevented by holding the mutex longer it won't be caught by a unit test. Regardless, this code path is tested by our end to end tests.

@davidbeesley
Copy link
Contributor Author

To clarify, batches were already copied out of the mutex protected queue into a local vec before sending. It's just that the lock on the queue was held around

Copy link
Contributor

@morenol morenol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidbeesley
Copy link
Contributor Author

bors r+

bors bot pushed a commit that referenced this pull request Dec 1, 2022
scopes mutex guard in `PartitionProducer` so that mutex is released immediately instead being held while batches are sent.

Means that `TopicProducer::send()` returns quickly instead of blocking while batches are sent.

This change gives almost a 10x improvement on producer throughput when producing to the cloud. Note the regression in latency is beacause all messages get "sent" right away where as previously messages blocked on `send` while waiting for other batches to go out. 

**Config**
```yaml---
matrix_name: ExampleMatrix
topic_name: benchmarking-baingmcvadyojtn
current_profile: cloud
timestamp: "2022-12-01T00:37:15.840379872Z"
worker_timeout:
  secs: 10000
  nanos: 0
num_samples: 100
duration_between_samples:
  secs: 0
  nanos: 300000000
num_records_per_producer_worker_per_batch: 1000
producer_batch_size: 16000
producer_queue_size: 100
producer_linger:
  secs: 0
  nanos: 10000000
producer_server_timeout:
  secs: 5
  nanos: 0
producer_compression: none
consumer_max_bytes: 64000
num_concurrent_producer_workers: 1
num_concurrent_consumers_per_partition: 1
num_partitions: 1
record_size: 1000
record_key_allocation_strategy: NoKey
```

**Per Record E2E Latency**

|Variable|  p0.00  |  p0.50  |  p0.95  |  p0.99  |  p1.00  |
|--------|---------|---------|---------|---------|---------|
|Latency |119.808ms|622.591ms|924.671ms|1.005567s|1.060863s|

**Throughput (Total Produced Bytes / Time)**

|     Variable      |   Min   | Median  |   Max   |                                        Description                                         |
|-------------------|---------|---------|---------|--------------------------------------------------------------------------------------------|
|Producer Throughput|2.165mb/s|2.658mb/s|3.052mb/s|                      First Produced Message <-> Last Produced Message                      |
|Consumer Throughput|1.154mb/s|1.389mb/s|1.563mb/s|First Consumed Message (First Time Consumed) <-> Last Consumed Message (First Time Consumed)|
|Combined Throughput|0.942mb/s|0.989mb/s|1.137mb/s|           First Produced Message <-> Last Consumed Message (First Time Consumed)           |

**Comparision with previous results: cloud @ 2022-12-01 00:24:33.995062094 UTC**

|     Variable      |Change|Previous | Current |P-Value|
|-------------------|------|---------|---------|-------|
|      Latency      |Worse |405.427ms|1.003788s|0.00000|
|Producer Throughput|Better|0.288mb/s|2.626mb/s|0.00000|
|Consumer Throughput|Better|0.249mb/s|1.384mb/s|0.00000|
|Combined Throughput|Better|0.240mb/s|0.996mb/s|0.00000|
@bors
Copy link

bors bot commented Dec 1, 2022

Pull request successfully merged into master.

Build succeeded:

@bors bors bot changed the title fix: release batches lock [Merged by Bors] - fix: release batches lock Dec 1, 2022
@bors bors bot closed this Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants