Embedding API Latency and Caching

An empirical comparison of embedding API latency from OpenAI, Cohere, Google Vertex AI, and Jina, examining temporal and environmental effects and emphasizing caching benefits.

Scala

Overview

Embeddings are a key ingredient in many modern ML technologies, such as RAG and semantic search. But every time you hit the OpenAI API to embed your search query, are you paying too high a latency toll?

We measured embedding API latency across OpenAI, Cohere, Google Vertex AI, and Jina, and now we know whether the time of day (or even the weather!) affects it. And yes, after seeing the numbers you will always cache embeddings!

Links

https://github.com/shuttie/embed-api-load-tester
This project is a load tester for an embeddable API.

Tech stack

Scala

Scala is a multi-paradigm, statically typed language that seamlessly integrates object-oriented and functional programming on the JVM.

Scala (Scalable Language) is a high-level, general-purpose language designed by Martin Odersky: it runs on the Java Virtual Machine (JVM), ensuring robust interoperability with existing Java libraries. It is a pure object-oriented language where every value is an object, yet it fully supports functional programming with features like higher-order functions and pattern matching. This powerful combination allows developers to build concise, type-safe, and highly scalable applications. Scala is the foundational language for major distributed computing frameworks (e.g., Apache Spark) and is used for building fast, concurrent, and distributed systems. Its expressive type system and modern features are key to managing complexity in large-scale projects.

https://www.scala-lang.org

View projects

Related projects

How we build the next generation embeddings and rerank model

Berlin

This talk explains the development of a state-of-the-art embeddings and rerank model that surpasses OpenAI's text-embedding-v3, enhancing AI…

Beyond Text: Building a fast Visual Search Engine

Berlin

Learn how to build a production‑grade visual search engine using 1280‑dimensional embeddings, multi‑tenant ingestion, GPU inference, and sub‑second…

High-throughput embedding generation for Vector DB corpus fill

San Francisco

This talk demonstrates an optimized TensorRT-LLM embedding runtime achieving up to twice the performance of alternatives, with code,…

AI Computer

Berlin

Learn how to build a desktop PC with an RTX 3090 for local AI workloads, covering hardware assembly, software…

Embedding Models in Action: From Category Mapping to Visual Search

Hamburg

Learn how embedding models automate product category mapping across marketplaces and power a visual search engine that detects…

The fastest cold starts in the world - a new type of docker registry and kubernetes written in rust

London

Learn how a Rust‑based Docker registry and rebuilt containerd reduce AI model cold start times by 3‑6×, with…