Benchmark Methodology

Our goal is to produce fair, reproducible, and meaningful benchmark results. This page documents our methodology in detail so anyone can understand, critique, and reproduce our tests.

Principles

Reproducibility First - Every benchmark can be reproduced using our open-source tooling and documented configurations. We publish all scripts, configurations, and raw data.
Two Configuration Modes - Each database is tested with default configurations first, and optimized configurations second, so users can see both out-of-box and tuned performance.
Apples-to-Apples Comparison - We test on identical hardware configurations and measure the same metrics across all systems.
Statistical Rigor - We run multiple iterations, report percentiles (p50, p95, p99), and document variance to ensure results are statistically meaningful.

Configuration Modes

Defaults

The "Defaults" configuration tests databases with minimal changes from their out-of-box settings:

Install the database using standard package manager or official Docker image
Only change settings required for the benchmark to run (connection limits, authentication)
No performance tuning applied
Durability settings left at defaults

This answers: "What do I get without any tuning?"

Optimized (Coming Soon)

Optimized configurations are coming soon. We're working with database communities to develop fair, well-tuned configurations.

The "Optimized" configuration will test databases with tuned settings:

Memory allocation tuned for hardware (buffer pools, caches)
I/O settings optimized for NVMe storage
Connection handling tuned for benchmark concurrency
Configurations reviewed by database maintainers or community experts

This will answer: "What's the performance ceiling with proper tuning?"

Database Benchmarks

Database benchmarks test self-hosted database engines to measure the raw performance of the database engine itself.

Current: GitHub-Hosted Runners

Benchmarks currently run on GitHub-hosted runners (ubuntu-latest):

Component	Specification
Machine	GitHub-hosted runner
CPU	4 cores
Memory	16 GB
Storage	SSD

💡

GitHub-hosted runners have limited resources. Results are useful for relative comparisons but not indicative of production performance.

Coming Soon: Dedicated Infrastructure

We're setting up a self-hosted GitHub Actions runner on dedicated AWS infrastructure in us-east-1 for more realistic benchmarks.

Future benchmarks will run on:

Component	Specification
Machine	AWS EC2 (dedicated instance)
CPU	16 cores
Memory	128 GB
Storage	1 TB NVMe SSD (direct-attached)
Region	us-east-1

Test Procedure

1. Environment Setup

# Fresh environment via GitHub Actions
# Install database with documented version
# Apply configuration (defaults or optimized)
# Verify configuration is loaded correctly

2. Data Loading

Load the TPC-C dataset with configured warehouse count
Wait for any background processes to complete (compaction, vacuuming)
Verify data integrity
Record load time

3. Warm-up Phase

Run workload for 5 minutes to warm caches
Discard warm-up results
Verify system is in steady state

4. Measurement Phase

Run workload for 30 minutes
Record all latency samples
Monitor system resources (CPU, memory, I/O)

5. Cool-down and Verification

Allow system to quiesce
Verify data integrity
Export raw results

Current Workload: TPC-C

We currently run TPC-C style benchmarks using BenchBase (opens in a new tab) (formerly OLTPBench).

Transaction Mix

Transaction	Percentage	Description
New Order	45%	Create new customer orders
Payment	43%	Process customer payments
Order Status	4%	Query order status
Delivery	4%	Process batch deliveries
Stock Level	4%	Check warehouse inventory

Scale Factor

The dataset size scales with warehouse count:

Warehouses	Approximate Data Size
10	~1 GB
100	~10 GB
1000	~100 GB

Default benchmarks use 100 warehouses (~10 GB) to ensure working set exceeds typical cache sizes while keeping run times reasonable.

Concurrency Levels

We test at multiple concurrency levels to understand scaling behavior:

16 concurrent connections
64 concurrent connections
256 concurrent connections
512 concurrent connections

Metrics Collected

Primary Metrics

Metric	Description
Throughput	Transactions per second (TPS)
Latency p50	Median response time
Latency p95	95th percentile response time
Latency p99	99th percentile response time
Latency p99.9	99.9th percentile response time (tail latency)

Secondary Metrics

CPU utilization (user, system, iowait)
Memory usage (used, cached, buffers)
Disk I/O (IOPS, throughput, latency)

Configuration Transparency

Every tested database has a published configuration file in our repository.

Defaults Example (PostgreSQL)

# Only essential changes from defaults
max_connections: 600
listen_addresses: '*'
# All other settings at PostgreSQL defaults

Optimized Example (Coming Soon)

# Tuned for 128GB RAM, NVMe storage
shared_buffers: 32GB
effective_cache_size: 96GB
work_mem: 256MB
maintenance_work_mem: 2GB
max_connections: 600
max_parallel_workers: 8
wal_level: replica
synchronous_commit: on

⚠️

We always test with durability enabled (synchronous_commit, fsync, etc.). Benchmarks with durability disabled would be noted separately.

Automation

Benchmarks are run automatically via GitHub Actions on a self-hosted runner. This ensures:

Consistency: Same environment for every run
Reproducibility: Anyone can trigger the same workflow
Transparency: All logs and artifacts are public

Benchmark Schedule

Triggered manually for new database versions
Scheduled weekly for regression detection
All results automatically published to this site

Reproducing Results

All benchmarks can be reproduced using our open-source tooling:

# Clone the benchmark repository
git clone https://github.com/supabase/oltp-benchmark
 
# Run TPC-C benchmark for PostgreSQL with defaults
./run-benchmark.sh postgresql --config defaults --warehouses 100 --duration 1800
 
# Run with optimized config (when available)
./run-benchmark.sh postgresql --config optimized --warehouses 100 --duration 1800

View Source Code on GitHub (opens in a new tab)

Service Provider Benchmarks (Coming Soon)

Service provider benchmarks will have a separate methodology document when released.

Service provider benchmarks will differ from database benchmarks:

Aspect	Database Benchmark	Service Provider Benchmark
Infrastructure	Standardized local hardware	Provider's infrastructure
Configuration	Defaults + Optimized	Limited to provider options
What's measured	Raw database performance	Complete service performance
Network	Local, minimal latency	Cloud networking included

Future Benchmarks

We're considering adding:

YCSB: Key-value style workloads (read-heavy, write-heavy, mixed)
Point Lookups: Pure primary key read performance
Secondary Index: Query performance on non-primary indexes
High Concurrency: 1000+ connection scaling tests

Limitations and Caveats

⚠️

Benchmarks are not a substitute for testing your specific workload. Results may vary significantly based on your specific query patterns, data distribution and access patterns, network topology, and operational requirements (backups, replication, etc.)

What We Don't Test

Multi-region replication latency
Failover and recovery time
Operational complexity
Cost per transaction
Specific SQL feature performance

Overview Results