Benchmark Methodology
Our goal is to produce fair, reproducible, and meaningful benchmark results. This page documents our methodology in detail so anyone can understand, critique, and reproduce our tests.
Principles
- Reproducibility First - Every benchmark can be reproduced using our open-source tooling and documented configurations. We publish all scripts, configurations, and raw data.
- Two Configuration Modes - Each database is tested with default configurations first, and optimized configurations second, so users can see both out-of-box and tuned performance.
- Apples-to-Apples Comparison - We test on identical hardware configurations and measure the same metrics across all systems.
- Statistical Rigor - We run multiple iterations, report percentiles (p50, p95, p99), and document variance to ensure results are statistically meaningful.
Configuration Modes
Defaults
The "Defaults" configuration tests databases with minimal changes from their out-of-box settings:
- Install the database using standard package manager or official Docker image
- Only change settings required for the benchmark to run (connection limits, authentication)
- No performance tuning applied
- Durability settings left at defaults
This answers: "What do I get without any tuning?"
Optimized (Coming Soon)
Optimized configurations are coming soon. We're working with database communities to develop fair, well-tuned configurations.
The "Optimized" configuration will test databases with tuned settings:
- Memory allocation tuned for hardware (buffer pools, caches)
- I/O settings optimized for NVMe storage
- Connection handling tuned for benchmark concurrency
- Configurations reviewed by database maintainers or community experts
This will answer: "What's the performance ceiling with proper tuning?"
Database Benchmarks
Database benchmarks test self-hosted database engines to measure the raw performance of the database engine itself.
Current: GitHub-Hosted Runners
Benchmarks currently run on GitHub-hosted runners (ubuntu-latest):
| Component | Specification |
|---|---|
| Machine | GitHub-hosted runner |
| CPU | 4 cores |
| Memory | 16 GB |
| Storage | SSD |
GitHub-hosted runners have limited resources. Results are useful for relative comparisons but not indicative of production performance.
Coming Soon: Dedicated Infrastructure
We're setting up a self-hosted GitHub Actions runner on dedicated AWS infrastructure in us-east-1 for more realistic benchmarks.
Future benchmarks will run on:
| Component | Specification |
|---|---|
| Machine | AWS EC2 (dedicated instance) |
| CPU | 16 cores |
| Memory | 128 GB |
| Storage | 1 TB NVMe SSD (direct-attached) |
| Region | us-east-1 |
Test Procedure
1. Environment Setup
# Fresh environment via GitHub Actions
# Install database with documented version
# Apply configuration (defaults or optimized)
# Verify configuration is loaded correctly2. Data Loading
- Load the TPC-C dataset with configured warehouse count
- Wait for any background processes to complete (compaction, vacuuming)
- Verify data integrity
- Record load time
3. Warm-up Phase
- Run workload for 5 minutes to warm caches
- Discard warm-up results
- Verify system is in steady state
4. Measurement Phase
- Run workload for 30 minutes
- Record all latency samples
- Monitor system resources (CPU, memory, I/O)
5. Cool-down and Verification
- Allow system to quiesce
- Verify data integrity
- Export raw results
Current Workload: TPC-C
We currently run TPC-C style benchmarks using BenchBase (opens in a new tab) (formerly OLTPBench).
Transaction Mix
| Transaction | Percentage | Description |
|---|---|---|
| New Order | 45% | Create new customer orders |
| Payment | 43% | Process customer payments |
| Order Status | 4% | Query order status |
| Delivery | 4% | Process batch deliveries |
| Stock Level | 4% | Check warehouse inventory |
Scale Factor
The dataset size scales with warehouse count:
| Warehouses | Approximate Data Size |
|---|---|
| 10 | ~1 GB |
| 100 | ~10 GB |
| 1000 | ~100 GB |
Default benchmarks use 100 warehouses (~10 GB) to ensure working set exceeds typical cache sizes while keeping run times reasonable.
Concurrency Levels
We test at multiple concurrency levels to understand scaling behavior:
- 16 concurrent connections
- 64 concurrent connections
- 256 concurrent connections
- 512 concurrent connections
Metrics Collected
Primary Metrics
| Metric | Description |
|---|---|
| Throughput | Transactions per second (TPS) |
| Latency p50 | Median response time |
| Latency p95 | 95th percentile response time |
| Latency p99 | 99th percentile response time |
| Latency p99.9 | 99.9th percentile response time (tail latency) |
Secondary Metrics
- CPU utilization (user, system, iowait)
- Memory usage (used, cached, buffers)
- Disk I/O (IOPS, throughput, latency)
Configuration Transparency
Every tested database has a published configuration file in our repository.
Defaults Example (PostgreSQL)
# Only essential changes from defaults
max_connections: 600
listen_addresses: '*'
# All other settings at PostgreSQL defaultsOptimized Example (Coming Soon)
# Tuned for 128GB RAM, NVMe storage
shared_buffers: 32GB
effective_cache_size: 96GB
work_mem: 256MB
maintenance_work_mem: 2GB
max_connections: 600
max_parallel_workers: 8
wal_level: replica
synchronous_commit: onWe always test with durability enabled (synchronous_commit, fsync, etc.). Benchmarks with durability disabled would be noted separately.
Automation
Benchmarks are run automatically via GitHub Actions on a self-hosted runner. This ensures:
- Consistency: Same environment for every run
- Reproducibility: Anyone can trigger the same workflow
- Transparency: All logs and artifacts are public
Benchmark Schedule
- Triggered manually for new database versions
- Scheduled weekly for regression detection
- All results automatically published to this site
Reproducing Results
All benchmarks can be reproduced using our open-source tooling:
# Clone the benchmark repository
git clone https://github.com/supabase/oltp-benchmark
# Run TPC-C benchmark for PostgreSQL with defaults
./run-benchmark.sh postgresql --config defaults --warehouses 100 --duration 1800
# Run with optimized config (when available)
./run-benchmark.sh postgresql --config optimized --warehouses 100 --duration 1800View Source Code on GitHub (opens in a new tab)
Service Provider Benchmarks (Coming Soon)
Service provider benchmarks will have a separate methodology document when released.
Service provider benchmarks will differ from database benchmarks:
| Aspect | Database Benchmark | Service Provider Benchmark |
|---|---|---|
| Infrastructure | Standardized local hardware | Provider's infrastructure |
| Configuration | Defaults + Optimized | Limited to provider options |
| What's measured | Raw database performance | Complete service performance |
| Network | Local, minimal latency | Cloud networking included |
Future Benchmarks
We're considering adding:
- YCSB: Key-value style workloads (read-heavy, write-heavy, mixed)
- Point Lookups: Pure primary key read performance
- Secondary Index: Query performance on non-primary indexes
- High Concurrency: 1000+ connection scaling tests
Limitations and Caveats
Benchmarks are not a substitute for testing your specific workload. Results may vary significantly based on your specific query patterns, data distribution and access patterns, network topology, and operational requirements (backups, replication, etc.)
What We Don't Test
- Multi-region replication latency
- Failover and recovery time
- Operational complexity
- Cost per transaction
- Specific SQL feature performance