Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Unix Socket Support #6413

Closed
wants to merge 6 commits into from

Conversation

adrianliechti
Copy link

The idea of this pull request is to ease integration of llama.cpp server using unix sockets instead tcp.
cpp-httplib has support for unix sockets built in: yhirose/cpp-httplib#1346

my idea was to not add an additional parameter, but use a --host prefix: unix:// (similar to docker's client/server pattern).

a very first attempt is here, mainly to understand if this is something you could imagine in the code.

(the file should not exist before)
./server --host unix:///tmp/llama.sock --model ~/Projects/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf

connect using socat

socat TCP-LISTEN:1234,fork UNIX-CONNECT:/tmp/llama.sock
curl http://localhost:1234/v1/model

connect using curl:

curl --unix-sock /tmp/llama.sock http://localhost/v1/models

open points:

  • make path absolute?
  • some error handling?

Copy link
Contributor

github-actions bot commented Mar 31, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 534 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8746.54ms p(90)=25799.57ms fails=0, finish reason: stop=534 truncated=0
  • Prompt processing (pp): avg=235.76tk/s p(90)=696.9tk/s total=206.45tk/s
  • Token generation (tg): avg=100.17tk/s p(90)=269.19tk/s total=131.19tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=0b70ac0f6606fd1583afeed5a0bacec035d34444
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 583.52, 583.52, 583.52, 583.52, 583.52, 668.32, 668.32, 668.32, 668.32, 668.32, 678.72, 678.72, 678.72, 678.72, 678.72, 683.72, 683.72, 683.72, 683.72, 683.72, 716.03, 716.03, 716.03, 716.03, 716.03, 710.09, 710.09, 710.09, 710.09, 710.09, 699.07, 699.07, 699.07, 699.07, 699.07, 676.52, 676.52, 676.52, 676.52, 676.52, 688.86, 688.86, 688.86, 688.86, 688.86, 688.17, 688.17, 688.17, 688.17, 688.17, 704.66, 704.66, 704.66, 704.66, 704.66, 724.39, 724.39, 724.39, 724.39, 724.39, 713.05, 713.05, 713.05, 713.05, 713.05, 711.05, 711.05, 711.05, 711.05, 711.05, 704.3, 704.3, 704.3, 704.3, 704.3, 708.47, 708.47, 708.47, 708.47, 708.47, 707.52, 707.52, 707.52, 707.52, 707.52, 715.17, 715.17, 715.17, 715.17, 715.17, 713.96, 713.96, 713.96, 713.96, 713.96, 713.15, 713.15, 713.15, 713.15, 713.15, 711.88, 711.88, 711.88, 711.88, 711.88, 711.97, 711.97, 711.97, 711.97, 711.97, 714.76, 714.76, 714.76, 714.76, 714.76, 720.86, 720.86, 720.86, 720.86, 720.86, 727.2, 727.2, 727.2, 727.2, 727.2, 728.22, 728.22, 728.22, 728.22, 728.22, 728.36, 728.36, 728.36, 728.36, 728.36, 734.39, 734.39, 734.39, 734.39, 734.39, 730.88, 730.88, 730.88, 730.88, 730.88, 729.13, 729.13, 729.13, 729.13, 729.13, 730.0, 730.0, 730.0, 730.0, 730.0, 731.2, 731.2, 731.2, 731.2, 731.2, 730.72, 730.72, 730.72, 730.72, 730.72, 731.03, 731.03, 731.03, 731.03, 731.03, 731.58, 731.58, 731.58, 731.58, 731.58, 737.91, 737.91, 737.91, 737.91, 737.91, 740.62, 740.62, 740.62, 740.62, 740.62, 740.69, 740.69, 740.69, 740.69, 740.69, 738.85, 738.85, 738.85, 738.85, 738.85, 737.25, 737.25, 737.25, 737.25, 737.25, 739.99, 739.99, 739.99, 739.99, 739.99, 743.24, 743.24, 743.24, 743.24, 743.24, 744.39, 744.39, 744.39, 744.39, 744.39, 722.39, 722.39, 722.39, 722.39, 722.39, 720.43, 720.43, 720.43, 720.43, 720.43, 712.51, 712.51, 712.51, 712.51, 712.51, 711.54, 711.54, 711.54, 711.54, 711.54, 710.12, 710.12, 710.12, 710.12, 710.12, 709.72, 709.72, 709.72, 709.72, 709.72, 712.14, 712.14, 712.14, 712.14, 712.14, 712.04, 712.04, 712.04, 712.04, 712.04, 706.51, 706.51, 706.51, 706.51, 706.51, 704.92, 704.92, 704.92, 704.92, 704.92, 707.56, 707.56, 707.56, 707.56, 707.56, 707.65, 707.65, 707.65, 707.65, 707.65, 705.13, 705.13, 705.13, 705.13, 705.13, 706.09, 706.09, 706.09, 706.09, 706.09, 706.07, 706.07, 706.07, 706.07, 706.07, 705.99, 705.99, 705.99, 705.99, 705.99, 707.16, 707.16, 707.16, 707.16, 707.16]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.19, 29.19, 29.19, 29.19, 29.19, 16.76, 16.76, 16.76, 16.76, 16.76, 17.16, 17.16, 17.16, 17.16, 17.16, 17.39, 17.39, 17.39, 17.39, 17.39, 17.47, 17.47, 17.47, 17.47, 17.47, 17.85, 17.85, 17.85, 17.85, 17.85, 18.75, 18.75, 18.75, 18.75, 18.75, 19.26, 19.26, 19.26, 19.26, 19.26, 19.53, 19.53, 19.53, 19.53, 19.53, 19.63, 19.63, 19.63, 19.63, 19.63, 19.94, 19.94, 19.94, 19.94, 19.94, 19.82, 19.82, 19.82, 19.82, 19.82, 19.44, 19.44, 19.44, 19.44, 19.44, 19.13, 19.13, 19.13, 19.13, 19.13, 18.83, 18.83, 18.83, 18.83, 18.83, 18.46, 18.46, 18.46, 18.46, 18.46, 18.59, 18.59, 18.59, 18.59, 18.59, 18.74, 18.74, 18.74, 18.74, 18.74, 18.58, 18.58, 18.58, 18.58, 18.58, 18.49, 18.49, 18.49, 18.49, 18.49, 18.41, 18.41, 18.41, 18.41, 18.41, 18.2, 18.2, 18.2, 18.2, 18.2, 18.18, 18.18, 18.18, 18.18, 18.18, 18.28, 18.28, 18.28, 18.28, 18.28, 18.2, 18.2, 18.2, 18.2, 18.2, 18.27, 18.27, 18.27, 18.27, 18.27, 18.32, 18.32, 18.32, 18.32, 18.32, 18.38, 18.38, 18.38, 18.38, 18.38, 18.27, 18.27, 18.27, 18.27, 18.27, 18.19, 18.19, 18.19, 18.19, 18.19, 18.28, 18.28, 18.28, 18.28, 18.28, 18.32, 18.32, 18.32, 18.32, 18.32, 18.35, 18.35, 18.35, 18.35, 18.35, 18.49, 18.49, 18.49, 18.49, 18.49, 18.55, 18.55, 18.55, 18.55, 18.55, 18.51, 18.51, 18.51, 18.51, 18.51, 18.44, 18.44, 18.44, 18.44, 18.44, 18.36, 18.36, 18.36, 18.36, 18.36, 18.34, 18.34, 18.34, 18.34, 18.34, 18.4, 18.4, 18.4, 18.4, 18.4, 18.42, 18.42, 18.42, 18.42, 18.42, 18.48, 18.48, 18.48, 18.48, 18.48, 18.43, 18.43, 18.43, 18.43, 18.43, 18.25, 18.25, 18.25, 18.25, 18.25, 18.21, 18.21, 18.21, 18.21, 18.21, 17.95, 17.95, 17.95, 17.95, 17.95, 17.93, 17.93, 17.93, 17.93, 17.93, 17.74, 17.74, 17.74, 17.74, 17.74, 17.42, 17.42, 17.42, 17.42, 17.42, 17.4, 17.4, 17.4, 17.4, 17.4, 17.47, 17.47, 17.47, 17.47, 17.47, 17.48, 17.48, 17.48, 17.48, 17.48, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58, 17.58, 17.61, 17.61, 17.61, 17.61, 17.61, 17.62, 17.62, 17.62, 17.62, 17.62, 17.68, 17.68, 17.68, 17.68, 17.68, 17.75, 17.75, 17.75, 17.75, 17.75, 17.83, 17.83, 17.83, 17.83, 17.83]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.31, 0.31, 0.31, 0.31, 0.31, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.22, 0.22, 0.22, 0.22, 0.22, 0.35, 0.35, 0.35, 0.35, 0.35, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.43, 0.43, 0.43, 0.43, 0.43, 0.48, 0.48, 0.48, 0.48, 0.48, 0.4, 0.4, 0.4, 0.4, 0.4, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 534 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1711960154 --> 1711960782
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    
Loading

@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 1, 2024

This should be tested in all platforms

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, please add a simple test scenario with unix socket family.

@adrianliechti
Copy link
Author

@FSSRepo
you are right. i tested on linux and macos, and added a ifndef for windows - similar to httplib's implementation. do you have more platforms in mind?

@phymbert
might you have some guidance here? shall i add a sample shell script or extend the python test suite?
i mainly ask because i don't want to slow down every test cycle for such a niche feature...

@phymbert
Copy link
Collaborator

phymbert commented Apr 1, 2024

might you have some guidance here? shall i add a sample shell script or extend the python test suite?

I suggest adding a simple dedicated scenario in a new feature using unix://. I hope no additional changes are required since we already checked the sock family in the python glue.
Regarding the overhead of the new scenario, we are using a very small model, so adding a new scenario matters in seconds. It's OK.

@ggerganov
Copy link
Owner

Regarding server tests, @phymbert has provided quite good documentation over here: https://github.com/ggerganov/llama.cpp/tree/master/examples/server/tests

One way to improve this even further and help new contributors to implement tests, is to reference a very small PR that introduces a basic server test, without any extra changes. I'm not sure if we have one yet - if not, we can create, and we can point people to that PR as a starting point for implementing new tests.

@phymbert
Copy link
Collaborator

phymbert commented Apr 1, 2024

One way to improve this even further and help new contributors to implement tests, is to reference a very small PR that introduces a basic server test, without any extra changes. I'm not sure if we have one yet - if not, we can create, and we can point people to that PR as a starting point for implementing new tests.

Yes, a good example is:

Copy link
Contributor

github-actions bot commented May 8, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 529 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8867.72ms p(95)=21960.19ms fails=, finish reason: stop=471 truncated=58
  • Prompt processing (pp): avg=100.77tk/s p(95)=417.74tk/s
  • Token generation (tg): avg=45.8tk/s p(95)=47.8tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=d7a7a780c95de47c96dcc16585099412d89e24be

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 566.99, 566.99, 566.99, 566.99, 566.99, 653.51, 653.51, 653.51, 653.51, 653.51, 656.19, 656.19, 656.19, 656.19, 656.19, 689.72, 689.72, 689.72, 689.72, 689.72, 776.04, 776.04, 776.04, 776.04, 776.04, 780.67, 780.67, 780.67, 780.67, 780.67, 808.43, 808.43, 808.43, 808.43, 808.43, 813.88, 813.88, 813.88, 813.88, 813.88, 834.12, 834.12, 834.12, 834.12, 834.12, 853.73, 853.73, 853.73, 853.73, 853.73, 850.75, 850.75, 850.75, 850.75, 850.75, 864.6, 864.6, 864.6, 864.6, 864.6, 876.12, 876.12, 876.12, 876.12, 876.12, 895.73, 895.73, 895.73, 895.73, 895.73, 897.79, 897.79, 897.79, 897.79, 897.79, 895.87, 895.87, 895.87, 895.87, 895.87, 882.94, 882.94, 882.94, 882.94, 882.94, 885.57, 885.57, 885.57, 885.57, 885.57, 891.63, 891.63, 891.63, 891.63, 891.63, 903.93, 903.93, 903.93, 903.93, 903.93, 905.31, 905.31, 905.31, 905.31, 905.31, 909.85, 909.85, 909.85, 909.85, 909.85, 908.08, 908.08, 908.08, 908.08, 908.08, 909.09, 909.09, 909.09, 909.09, 909.09, 921.8, 921.8, 921.8, 921.8, 921.8, 919.74, 919.74, 919.74, 919.74, 919.74, 921.02, 921.02, 921.02, 921.02, 921.02, 922.19, 922.19, 922.19, 922.19, 922.19, 917.22, 917.22, 917.22, 917.22, 917.22, 915.05, 915.05, 915.05, 915.05, 915.05, 915.96, 915.96, 915.96, 915.96, 915.96, 912.15, 912.15, 912.15, 912.15, 912.15, 909.54, 909.54, 909.54, 909.54, 909.54, 908.38, 908.38, 908.38, 908.38, 908.38, 909.84, 909.84, 909.84, 909.84, 909.84, 918.0, 918.0, 918.0, 918.0, 918.0, 921.16, 921.16, 921.16, 921.16, 921.16, 923.75, 923.75, 923.75, 923.75, 923.75, 879.25, 879.25, 879.25, 879.25, 879.25, 876.11, 876.11, 876.11, 876.11, 876.11, 876.72, 876.72, 876.72, 876.72, 876.72, 880.29, 880.29, 880.29, 880.29, 880.29, 881.12, 881.12, 881.12, 881.12, 881.12, 884.91, 884.91, 884.91, 884.91, 884.91, 863.79, 863.79, 863.79, 863.79, 863.79, 864.47, 864.47, 864.47, 864.47, 864.47, 864.8, 864.8, 864.8, 864.8, 864.8, 862.39, 862.39, 862.39, 862.39, 862.39, 864.09, 864.09, 864.09, 864.09, 864.09, 853.67, 853.67, 853.67, 853.67, 853.67, 852.67, 852.67, 852.67, 852.67, 852.67, 853.51, 853.51, 853.51, 853.51, 853.51, 853.54, 853.54, 853.54, 853.54, 853.54, 855.43, 855.43, 855.43, 855.43, 855.43, 858.51, 858.51, 858.51, 858.51, 858.51, 859.65, 859.65, 859.65, 859.65, 859.65, 864.13, 864.13, 864.13, 864.13, 864.13, 864.37, 864.37, 864.37, 864.37, 864.37, 864.19, 864.19, 864.19, 864.19, 864.19, 860.84, 860.84, 860.84, 860.84, 860.84, 860.95, 860.95, 860.95, 860.95]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.39, 44.39, 44.39, 44.39, 44.39, 38.16, 38.16, 38.16, 38.16, 38.16, 28.29, 28.29, 28.29, 28.29, 28.29, 31.48, 31.48, 31.48, 31.48, 31.48, 31.46, 31.46, 31.46, 31.46, 31.46, 33.49, 33.49, 33.49, 33.49, 33.49, 34.81, 34.81, 34.81, 34.81, 34.81, 34.94, 34.94, 34.94, 34.94, 34.94, 35.18, 35.18, 35.18, 35.18, 35.18, 34.89, 34.89, 34.89, 34.89, 34.89, 34.87, 34.87, 34.87, 34.87, 34.87, 33.86, 33.86, 33.86, 33.86, 33.86, 32.8, 32.8, 32.8, 32.8, 32.8, 32.44, 32.44, 32.44, 32.44, 32.44, 31.73, 31.73, 31.73, 31.73, 31.73, 31.07, 31.07, 31.07, 31.07, 31.07, 28.64, 28.64, 28.64, 28.64, 28.64, 28.59, 28.59, 28.59, 28.59, 28.59, 29.05, 29.05, 29.05, 29.05, 29.05, 29.0, 29.0, 29.0, 29.0, 29.0, 28.9, 28.9, 28.9, 28.9, 28.9, 28.88, 28.88, 28.88, 28.88, 28.88, 28.98, 28.98, 28.98, 28.98, 28.98, 29.13, 29.13, 29.13, 29.13, 29.13, 29.31, 29.31, 29.31, 29.31, 29.31, 29.3, 29.3, 29.3, 29.3, 29.3, 29.41, 29.41, 29.41, 29.41, 29.41, 29.64, 29.64, 29.64, 29.64, 29.64, 29.71, 29.71, 29.71, 29.71, 29.71, 29.79, 29.79, 29.79, 29.79, 29.79, 30.14, 30.14, 30.14, 30.14, 30.14, 30.19, 30.19, 30.19, 30.19, 30.19, 30.33, 30.33, 30.33, 30.33, 30.33, 30.38, 30.38, 30.38, 30.38, 30.38, 30.54, 30.54, 30.54, 30.54, 30.54, 30.47, 30.47, 30.47, 30.47, 30.47, 30.46, 30.46, 30.46, 30.46, 30.46, 29.93, 29.93, 29.93, 29.93, 29.93, 29.81, 29.81, 29.81, 29.81, 29.81, 29.75, 29.75, 29.75, 29.75, 29.75, 29.79, 29.79, 29.79, 29.79, 29.79, 29.99, 29.99, 29.99, 29.99, 29.99, 30.08, 30.08, 30.08, 30.08, 30.08, 30.22, 30.22, 30.22, 30.22, 30.22, 30.18, 30.18, 30.18, 30.18, 30.18, 29.93, 29.93, 29.93, 29.93, 29.93, 29.3, 29.3, 29.3, 29.3, 29.3, 28.83, 28.83, 28.83, 28.83, 28.83, 28.75, 28.75, 28.75, 28.75, 28.75, 28.77, 28.77, 28.77, 28.77, 28.77, 28.82, 28.82, 28.82, 28.82, 28.82, 28.91, 28.91, 28.91, 28.91, 28.91, 28.95, 28.95, 28.95, 28.95, 28.95, 29.05, 29.05, 29.05, 29.05, 29.05, 29.1, 29.1, 29.1, 29.1, 29.1, 29.01, 29.01, 29.01, 29.01, 29.01, 28.95, 28.95, 28.95, 28.95, 28.95, 28.98, 28.98, 28.98, 28.98, 28.98, 29.08, 29.08, 29.08, 29.08, 29.08, 29.2, 29.2, 29.2, 29.2, 29.2, 29.24, 29.24, 29.24, 29.24]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.44, 0.44, 0.44, 0.44, 0.44, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.45, 0.45, 0.45, 0.45, 0.45, 0.49, 0.49, 0.49, 0.49, 0.49, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.24, 0.24, 0.24, 0.24, 0.24, 0.29, 0.29, 0.29, 0.29, 0.29, 0.3, 0.3, 0.3, 0.3, 0.3, 0.36, 0.36, 0.36, 0.36, 0.36, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.45, 0.45, 0.45, 0.45, 0.45, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.35, 0.35, 0.35, 0.35, 0.35, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717783052 --> 1717783678
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

# Conflicts:
#	examples/server/server.cpp
@adrianliechti adrianliechti closed this by deleting the head repository Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants