Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the performance of GetTS #1847

Open
nolouch opened this issue Oct 25, 2019 · 11 comments
Open

Optimize the performance of GetTS #1847

nolouch opened this issue Oct 25, 2019 · 11 comments
Labels
difficulty/hard Hard task. type/enhancement The issue or PR belongs to an enhancement.

Comments

@nolouch
Copy link
Contributor

nolouch commented Oct 25, 2019

Description

Obtaining TSO is a key step in the transaction. Improving the performance of GetTS can improve the performance of the entire cluster. One of the more obvious problems is that when the client is busy, getting TSO has a significant long tail problem. We have a benchmark client bench-tso in PD project, and you can run it with 10000 concurrency. Example output:

count:947492, max:50, min:8, >1ms:0, >2ms:0, >5ms:478043, >10ms:459063, >30ms:10386
count:982164, max:13, min:8, >1ms:0, >2ms:0, >5ms:487404, >10ms:494760, >30ms:0
count:973954, max:13, min:8, >1ms:0, >2ms:0, >5ms:385329, >10ms:588625, >30ms:0
count:972821, max:14, min:8, >1ms:0, >2ms:0, >5ms:377676, >10ms:595145, >30ms:0
count:975413, max:14, min:8, >1ms:0, >2ms:0, >5ms:412231, >10ms:563182, >30ms:0
count:941196, max:21, min:8, >1ms:0, >2ms:0, >5ms:303737, >10ms:637459, >30ms:0
count:961796, max:14, min:8, >1ms:0, >2ms:0, >5ms:369033, >10ms:592763, >30ms:0
count:973593, max:14, min:8, >1ms:0, >2ms:0, >5ms:418633, >10ms:554960, >30ms:0
count:982489, max:13, min:8, >1ms:0, >2ms:0, >5ms:472591, >10ms:509898, >30ms:0
count:931930, max:61, min:8, >1ms:0, >2ms:0, >5ms:448291, >10ms:471571, >30ms:12068
count:979887, max:13, min:8, >1ms:0, >2ms:0, >5ms:419038, >10ms:560849, >30ms:0
count:973117, max:25, min:8, >1ms:0, >2ms:0, >5ms:478479, >10ms:494638, >30ms:0
count:984385, max:13, min:8, >1ms:0, >2ms:0, >5ms:483516, >10ms:500869, >30ms:0
count:939342, max:19, min:8, >1ms:0, >2ms:0, >5ms:280159, >10ms:659183, >30ms:0
count:935988, max:32, min:8, >1ms:0, >2ms:0, >5ms:302658, >10ms:633283, >30ms:47
count:991155, max:14, min:8, >1ms:0, >2ms:0, >5ms:593176, >10ms:397979, >30ms:0
count:984702, max:13, min:8, >1ms:0, >2ms:0, >5ms:422315, >10ms:562387, >30ms:0
count:957891, max:31, min:8, >1ms:0, >2ms:0, >5ms:394925, >10ms:562376, >30ms:590
count:976569, max:13, min:8, >1ms:0, >2ms:0, >5ms:441844, >10ms:534725, >30ms:0
count:974835, max:24, min:8, >1ms:0, >2ms:0, >5ms:459893, >10ms:514942, >30ms:0
count:986326, max:14, min:8, >1ms:0, >2ms:0, >5ms:477202, >10ms:509124, >30ms:0
count:978978, max:13, min:8, >1ms:0, >2ms:0, >5ms:444974, >10ms:534004, >30ms:0
count:975067, max:13, min:8, >1ms:0, >2ms:0, >5ms:404418, >10ms:570649, >30ms:0
count:971453, max:21, min:8, >1ms:0, >2ms:0, >5ms:448108, >10ms:523345, >30ms:0
count:961049, max:23, min:8, >1ms:0, >2ms:0, >5ms:410778, >10ms:550271, >30ms:0
count:983659, max:13, min:8, >1ms:0, >2ms:0, >5ms:423110, >10ms:560549, >30ms:0
count:951750, max:30, min:8, >1ms:0, >2ms:0, >5ms:397103, >10ms:554362, >30ms:285

The requests greater than 30ms occur periodically.
Goals:

  • Improve the long tail problem in TiDB
  • Improve the ability to acquire TS ( more TPS )

Difficulty

  • Hard

Score

  • 20000

Mentor(s)

Recommended Skills

  • Go
@nolouch nolouch added type/enhancement The issue or PR belongs to an enhancement. difficulty/hard Hard task. PCP-S1 labels Oct 25, 2019
@breezewish breezewish changed the title PCP: Optimize the performance of GetTS PCP-26: Optimize the performance of GetTS Oct 28, 2019
@you06 you06 changed the title PCP-26: Optimize the performance of GetTS Optimize the performance of GetTS Feb 29, 2020
@mgkanani
Copy link
Contributor

@nolouch
I tried to reproduce this on my local. I coundn't able to reproduce it.
Below are the steps I followed:

<pd-root-dir-location>$ make
<pd-root-dir-location>$ make pd-tso-bench
<pd-root-dir-location>$ bin/pd-server

on different terminal:

<pd-root-dir-location>$ bin/pd-tso-bench -C=10000

I tried with 2 different go-version I.e. 1.13.3 & 1.12.1

Can you specify the RAM and number of CPUs used for the setup to run? I will try to it in container with limits to reproduce.

@nolouch
Copy link
Contributor Author

nolouch commented Mar 12, 2020

Hi @mgkanani, Thank you for your attention.
I run it in my 8vCPU mbp. maybe you can increase the concurrency, try 20000, 30000...
and this problem is likely to be related to Go's goroutine scheduling. You can try to reproduce it locally first.

@mgkanani
Copy link
Contributor

mgkanani commented Mar 14, 2020

@nolouch ,
Thank you for the info. Earlier I tried those option but could not able to reproduce somehow.
Right now, I am able to reproduce with 35000 in my system.

I tried few options. The issue stopped occurring when GC is turned off and invoked only per Second. The reasoning behind so:

  • In client.go, we are trying to cache the memory to avoid allocation cost.
  • As per pool.go, If GC() is triggered enough times, it will shrink Pool size.

Let's assume below steps:

  • 10k times tsoReqPool.Put(x) triggered --> free pool-size =10000, victimSize=0
  • GC() --> free pool cache moved to victim variable --> localSize(free Pool)=0, victimSize=10k
  • 5ktimes Pool.Get() called --> localSize=0, victimSize=5k // 5k allocation avoided
  • GC() --> localSize=0, victimSize=0
  • 5ktimes Pool.Put(x) called --> localSize=5k, victimSize=0
  • GC() --> localSize=0, victimSize=5k
  • GC() --> localSize=0, victimSize=0
  • 5ktimes Pool.Get() called --> localSize=0, victimSize=0 // new 5k allocation required

So it requires 2 GC() invocations to clear unused pool memory.

Changes I made to disable GC() and trigger only per second:

diff --git a/tools/pd-tso-bench/main.go b/tools/pd-tso-bench/main.go
index 0753b3ad..aaa5cfac 100644
--- a/tools/pd-tso-bench/main.go
+++ b/tools/pd-tso-bench/main.go
@@ -19,6 +19,7 @@ import (
        "fmt"
        "os"
        "os/signal"
+        "runtime"
        "sync"
        "syscall"
        "time"
@@ -101,6 +102,7 @@ func showStats(ctx context.Context, durCh chan time.Duration) {
        for {
                select {
                case <-ticker.C:
+                       runtime.GC()
                        println(s.String())
                        total.merge(s)
                        s = newStats()

command to run:

make pd-tso-bench
export GOGC=off; bin/pd-tso-bench -C=35000

PS:
If the golang version is 1.12 then just one GC() will flush entire Pool's free memory.
This problem has been resolved in 1.13 onwards.

@nolouch
Copy link
Contributor Author

nolouch commented Mar 14, 2020

Amazing! @mgkanani that's an awesome work and your point is fantastic, PTAL @disksing @tiancaiamao
and I test it in my workstation,it does improve but still has the same problem:

❯ export GOGC=off; bin/pd-tso-bench -C=10000
count:606386, max:7, min:0, >1ms:23556, >2ms:14991, >5ms:7497, >10ms:0, >30ms:0
count:610780, max:7, min:0, >1ms:18153, >2ms:6176, >5ms:969, >10ms:0, >30ms:0
count:617561, max:13, min:0, >1ms:30014, >2ms:17428, >5ms:15079, >10ms:8808, >30ms:0
count:615149, max:7, min:0, >1ms:23845, >2ms:10466, >5ms:5286, >10ms:0, >30ms:0
count:619419, max:5, min:0, >1ms:22575, >2ms:3694, >5ms:172, >10ms:0, >30ms:0
count:615440, max:5, min:0, >1ms:12348, >2ms:4285, >5ms:3, >10ms:0, >30ms:0
count:612421, max:10, min:0, >1ms:18230, >2ms:1427, >5ms:232, >10ms:161, >30ms:0
count:615669, max:15, min:0, >1ms:13467, >2ms:10074, >5ms:3700, >10ms:2501, >30ms:0
count:582374, max:8, min:0, >1ms:19808, >2ms:729, >5ms:962, >10ms:0, >30ms:0
count:580256, max:11, min:0, >1ms:23707, >2ms:6213, >5ms:4486, >10ms:428, >30ms:0
count:605998, max:10, min:0, >1ms:20625, >2ms:236, >5ms:295, >10ms:109, >30ms:0
count:609601, max:16, min:0, >1ms:22940, >2ms:2588, >5ms:8761, >10ms:1803, >30ms:0
count:593323, max:5, min:0, >1ms:27557, >2ms:2178, >5ms:27, >10ms:0, >30ms:0
count:604243, max:5, min:0, >1ms:18285, >2ms:1836, >5ms:4, >10ms:0, >30ms:0
count:599705, max:10, min:0, >1ms:13896, >2ms:134, >5ms:1077, >10ms:170, >30ms:0
count:611364, max:5, min:0, >1ms:19566, >2ms:6184, >5ms:65, >10ms:0, >30ms:0
count:613677, max:5, min:0, >1ms:19995, >2ms:1883, >5ms:35, >10ms:0, >30ms:0
count:609633, max:5, min:0, >1ms:19990, >2ms:1127, >5ms:31, >10ms:0, >30ms:0
count:606090, max:6, min:0, >1ms:23818, >2ms:3037, >5ms:1196, >10ms:0, >30ms:0
count:566737, max:15, min:0, >1ms:31681, >2ms:24356, >5ms:8482, >10ms:7351, >30ms:0
count:601414, max:7, min:0, >1ms:27836, >2ms:14706, >5ms:2242, >10ms:0, >30ms:0
count:593375, max:9, min:0, >1ms:23476, >2ms:15011, >5ms:793, >10ms:0, >30ms:0
count:599079, max:6, min:0, >1ms:26653, >2ms:10718, >5ms:404, >10ms:0, >30ms:0
count:1343571, max:18, min:0, >1ms:4489, >2ms:15553, >5ms:1238196, >10ms:31353, >30ms:0
count:1440656, max:26, min:5, >1ms:0, >2ms:0, >5ms:1413910, >10ms:26746, >30ms:0
count:818916, max:142, min:0, >1ms:16072, >2ms:22754, >5ms:507016, >10ms:233779, >30ms:28853
count:840317, max:45, min:0, >1ms:18455, >2ms:12826, >5ms:463418, >10ms:317002, >30ms:10068
count:998703, max:51, min:5, >1ms:0, >2ms:0, >5ms:735523, >10ms:260104, >30ms:3076
count:1471386, max:30, min:4, >1ms:0, >2ms:194, >5ms:1423303, >10ms:47869, >30ms:20

the numbers in the last few lines of this details expose the problem.

The main problem we want to solve: When tidb is very busy, the following entries occasionally appear in the log:

[2020/03/15 00:30:48.006 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=61.689168ms]
[2020/03/15 00:30:48.010 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=65.400363ms]
[2020/03/15 00:30:48.010 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=63.914003ms]
[2020/03/15 00:32:21.983 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=39.321432ms]
[2020/03/15 00:32:21.983 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=38.332206ms]
[2020/03/15 00:32:21.983 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=38.294493ms]
[2020/03/15 00:32:42.749 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_WAIT] resp_time:776.512795ms txnStartTS:415289458539888643 region_id:22 store_addr:172.16.5.38:20160 kv_wait_ms:764"]
[2020/03/15 00:32:46.029 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=88.071116ms]
[2020/03/15 00:32:46.044 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=99.70434ms]
[2020/03/15 00:33:03.409 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_PROCESS] resp_time:436.692862ms txnStartTS:415289464044912643 region_id:22 store_addr:172.16.5.38:20160"]
[2020/03/15 00:34:47.085 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=39.095808ms]
[2020/03/15 00:34:47.092 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=41.958004ms]
[2020/03/15 00:34:48.371 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_PROCESS] resp_time:397.802191ms txnStartTS:415289491570032646 region_id:38 store_addr:172.16.5.38:20160"]
[2020/03/15 00:35:09.531 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_WAIT] resp_time:556.581142ms txnStartTS:415289497075056644 region_id:38 store_addr:172.16.5.38:20160 kv_wait_ms:546"]
[2020/03/15 00:35:30.319 +08:00] [INFO] [coprocessor.go:855] ["[TIME_COP_PROCESS] resp_time:345.113814ms txnStartTS:415289502580080645 region_id:24 store_addr:172.16.5.38:20160"]
[2020/03/15 00:36:09.965 +08:00] [WARN] [pd.go:109] ["get timestamp too slow"] ["cost time"=36.675042ms]

It's very easy to reproduce:

  • startup 1 tidb, 1 tikv, 1 pd, remember to run tidb with the config like:

    max-server-connections = 40960
    token-limit = 100000
    
  • use sysbench to bench the tidb:

     #client 1
     sysbench --report-interval=10 --time=1000 --rand-type=uniform --threads=4000 --mysql-host=127.0.0.1 --mysql-port=4001 --mysql-user=root --mysql-db=sbtest1 --db-driver=mysql --mysql-ignore-errors=8002 oltp_point_select --thread-stack-size=1k --tables=1 --table-size=200000 run
    
     #client 2
     sysbench --report-interval=10 --time=1000 --rand-type=uniform --threads=4000 --mysql-host=127.0.0.1 --mysql-port=4001 --mysql-user=root --mysql-db=sbtest1 --db-driver=mysql --mysql-ignore-errors=8002 oltp_point_select --thread-stack-size=1k --tables=1 --table-size=200000 run
    
     ....
    

Do you have any ideas?

@tiancaiamao
Copy link
Contributor

  1. You can't disable GC in the real scenario.
  2. If the sync.Pool is replaced by a customed one, could we alleviate the situation of GetTS latency?
  3. The problem behind the scene is the TSO goroutine is to damn crucial, this goroutine's performance would affect all other goroutines as they depend on it. I have a blog post about it (in chinese) long before.

@mgkanani

@mgkanani
Copy link
Contributor

@nolouch
Can you try with these changes?

@tiancaiamao ,
Yes, we should not disable GC in production systems.
I have earlier read about goroutines designed as co-routines. In golang there is no priority among go-routines. If there are enough number of cpu-intensive go-routines then latency related problems surfaces.
There is some developments to ease the problem by preemption. Hopefully, it may reduce the latency problem if there are such go-routines in pd-clients.

In my changes, I have chosen the default Array size based on metrics data.
These changes are temporary, we should not use summary if distribution is known. Use of summary for pd-client will increase memory requirements and calculation of quantiles is also cpu-intensive.
Basically, earlier array was created and doubled on overflow. The updated code avoids the allocation + extra unnecessary copying cost if the size is <256.

During my investigations, I found that the gRPC-server who handles TSO request never reported latency concerns(partially due to batching at client side before requesting to pd-server).
One way to handle latency related issue is through capacity planning. We should come up with the data like:

  • X cores, Y GB memory, tidb can handle Z requests to pd-cluster

Based on these data, we can scale/shard pd-cluster and tidb cluster whichever could be bottleneck.

@nolouch
Copy link
Contributor Author

nolouch commented Mar 25, 2020

thanks @mgkanani.
I meet a problem with commit 54c1468, looks like caused by currentSize .

@mgkanani
Copy link
Contributor

mgkanani commented Jun 11, 2020

  1. The problem behind the scene is the TSO goroutine is to damn crucial, this goroutine's performance would affect all other goroutines as they depend on it. I have a blog post about it (in chinese) long before.

@mgkanani

TL;DR
@tiancaiamao
As part of #2300, I tried various options to fix the issue. I tried runtime.Gosched() with Atomic + Sleep with different variations. Later realized, channels are just not being used for message passing but also waking the blocked go-routine.
I tried pprof and tracing while running tso-benchmark.
Below are my findings and proposal to address them.

As per current implementation:

Long running routines:

  • tsCancelLoop —> looks for timeout during request is out on network
  • tsLoop —> responsible for batching requests, contacts pd stream server and returns responses to each request by calling finishTsoLoop
  • GetTSAsync —> origination of the new request before batching, pushes the newly created struct on chan for batching(tsLoop)
  • Wait —> selectively waits for either context-done or request-done

Bottlenecks:

  • Channel used for batching i.e. Client.tsoRequests. tsLoop is consumer and GetTSAsync is producer.
  • There is one tsLoop go-routine and many GetTSAsync go-routine tries to access the same channel at different time.
    • Profile data observations(pprof):
      • GetTSAsync —> channel’s lock call contributed 9.07%, out of it 5.62% runtime.usleep(invoked due to function call from runtime.lock) —> clear indication of contention for lock acquisition
      • tsLoop, receiving from chan —> lock while receiving from channel: 0.64% (0.13% spent behind runtime.usleep while acquiring lock on channel), and in select 0.12%
    • Trace data observation(trace tool’s Synchronization blocking profile): ~24.82% GetTSAsync blocked on runtime.chansend1
  • tsLoop and tsCancelLoop communicate via channel(Buffer size is 1). If tsLoop gets enough cpu time and tsCancelLoop doesn’t then due to channel-size 1, tsLoop will be blocked. However, I couldn’t find in profile/trace data but can potentially occur.

Proposal:

tsLoop being a consumer, should never wait for lock as long as data is available.
We can change design as per below:

  1. Use of lock-free ring-buffer for batching
    • The pattern here is MultiProducer-SingleConsumer
    • Channels are also handling blocking/resuming a go-routine. We have to come up with our own mechanism to wake a go-routine during finishTsoLoop
  2. Offloading finishTsoLoop function call from critical path, tsCancelLoop can handle it. However, this can be taken care later.
  3. I have figured out a way to block and unblock a go-routine similar to event-driven https://github.com/mgkanani/mgkanani.github.io/tree/master/golang/runtime.
    • This API can be used with RingBuffer implementation to block/wake go-routines during various conditions.
    • In fact, based on the data-structures, we can control which go-routines should be wakened hence indirectly we can define the priority.

I am thinking of trying above approach and comparing it with channels via benchmarks.

Tradeoffs:

  • The downside of this approach is, we have to update golang’s source file until they offer some similar solution

@tiancaiamao
Copy link
Contributor

A kind remind: the benchmark program may not reflect the real case somehow.

In the benchmark, the batch size can be quite large, producers call GetAsync at a much higher rate than the consumer handle the data and the channel buffer could be a bottlenect.
In TiDB, there is not so many concurrent GetAsync call.

And another factor is the goroutine count on the scheduler.
You may add 100 / 200 / 500 / 1000 other gorouintes which just do nothing, and let them run together with the benchmark program to see their affection. In TiDB there maybe 1000~3000 unrelated goroutines.

@mgkanani

@mgkanani
Copy link
Contributor

A kind remind: the benchmark program may not reflect the real case somehow.

In the benchmark, the batch size can be quite large, producers call GetAsync at a much higher rate than the consumer handle the data and the channel buffer could be a bottlenect.
In TiDB, there is not so many concurrent GetAsync call.

And another factor is the goroutine count on the scheduler.
You may add 100 / 200 / 500 / 1000 other gorouintes which just do nothing, and let them run together with the benchmark program to see their affection. In TiDB there maybe 1000~3000 unrelated goroutines.

@mgkanani

I agree with your remarks. With sysbench, I didn't see data which tells tsoRequests channel is bottleneck. With tracing data, I saw many go-routines. To be specific, for 4000 clients, there were ~24000 github.com/pingcap/tidb/store/tikv.(*copIteratorWorker).run and github.com/pingcap/tidb/store/tikv.(*copIteratorTaskSender).run go-routines each apart from
4000 github.com/pingcap/tidb/server.(*Server).onConn. This lead to runnable go-routines to 100s to 1000s always.

Golang follows cooperative and fair scheduling principles. Even if, we able schedule tsLoop when we desired, it will not solve the latency problem. finishTsoLoop unblocks (*client).Wait go-routines. These go-routines don’t always result into immediate scheduling because of high number of runnable go-routines.

It is better to create separate Issue for TiDB-repo and close this. In TiDB, we should look for reducing unnecessary cpu-intensive go-routines.

@tiancaiamao

@nolouch
Copy link
Contributor Author

nolouch commented Jun 28, 2020

Hi @mgkanani
As the issue says, the goal is to improve the long tail problem in TiDB. So, This issue still needs to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty/hard Hard task. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

4 participants