llama : fix FA when KV cache is not used (i.e. embeddings) (#12825)

* ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci
2025-04-08 19:54:51 +03:00 · 2025-04-08 19:54:51 +03:00 · a19b5cef16
commit a19b5cef16
parent 78a1ba0a4f
6 changed files with 59 additions and 6 deletions
--- a/examples/server_embd.py
+++ b/examples/server_embd.py
@ -15,7 +15,7 @@ async def main():
    model_url = "http://127.0.0.1:6900"
    responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
        url= f"{model_url}/embedding",
-        json= {"content": str(0)*1024}
+        json= {"content": "a "*1022}
    ) for i in range(n)])

    for response in responses: