@@ -10,34 +10,89 @@ have been performed on your production system before you go live.
1010
1111## Operating System
1212
13- - Executed the OS optimization scripts if you run ArangoDB on Linux.
13+ - Executed the operating system (OS) optimization scripts if you run ArangoDB on Linux.
1414 See [ Installing ArangoDB on Linux] ( ../operations/installation/linux/_index.md ) and its sub pages
1515 [ Linux Operating System Configuration] ( ../operations/installation/linux/operating-system-configuration.md ) and
1616 [ Linux OS Tuning Script Examples] ( ../operations/installation/linux/linux-os-tuning-script-examples.md ) for details.
1717
18- - OS monitoring is in place
19- (most common metrics, e.g. disk, CPU, RAM utilization).
18+ - Ensure your OS is compatible with your ArangoDB version
19+ and keep it up to date at all times for security and stability.
20+
21+ - OS monitoring is in place with specific alerting thresholds:
22+ - ** Disk usage** : Alert when reaching 60% (red line threshold).
23+ - ** CPU usage** : Alert when reaching 90% (red line threshold).
24+ - ** Memory usage** : Alert when reaching 85% (red line threshold).
2025
2126- Disk space monitoring is in place. Consider setting up alerting to avoid out-of-disk situations.
2227
2328## ArangoDB
2429
25- - The user _ root_ is not used to run any ArangoDB processes
30+ - ** Use the latest versions** : Deploy the latest version series
31+ of ArangoDB to benefit from performance improvements and security fixes.
32+
33+ - ** Testing environments** : Use QA environments and UAT (User Acceptance Testing)
34+ to test all changes, in particular queries, before going live with production deployments.
35+
36+ ### Security
37+
38+ - Create a dedicated system user and group (e.g., "arango")
39+ to run ArangoDB processes. Never use the _ root_ user to run any ArangoDB processes
2640 (if you run ArangoDB on Linux).
2741
42+ - ** Access control** : Restrict access to the deployment to authorized personnel only.
43+ Implement proper authentication and authorization mechanisms.
44+
45+ - ** JWT authentication** : Enable JWT authentication
46+ for production deployments. See [ JWT authentication] ( ../develop/http-api/authentication.md#jwt-user-tokens ) for more details.
47+
48+ - ** Encryption** : Enable [ Encryption at Rest] ( ../operations/security/encryption-at-rest.md )
49+ for sensitive data. Make sure to safely store any secret keys you create for this.
50+
51+ ### Logging and Monitoring
52+
2853- The _ arangod_ (server) process and the _ arangodb_ (_ Starter_ ) process
2954 (if in use) have some form of logging enabled and logs can easily be
3055 located and inspected.
31-
32- - * Memory considerations*
33- - If you run multiple processes (e.g. DB-Server and Coordinator) on a single
34- machine, adjust the [ ` ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY ` ] ( ../components/arangodb-server/environment-variables.md )
35- environment variable accordingly.
36- - For versions prior to 3.8, make sure to change the
37- [ ` --query.memory-limit ` ] ( ../components/arangodb-server/options.md#--querymemory-limit )
38- query option according to the node size and workload.
39- - Disable swap space to avoid slowdown which can result in servers being incorrectly
40- detected as failed.
56+
57+ - ** Third-party monitoring** : Configure third-party metrics monitoring tools like
58+ Grafana with Prometheus to monitor ArangoDB metrics comprehensively.
59+
60+ - ** Configure metrics collection** : Enable the ArangoDB metrics API for production monitoring:
61+ - Set [ ` --server.export-metrics-api ` ] ( ../components/arangodb-server/options.md#--serverexport-metrics-api ) to ` true ` to enable the metrics endpoints
62+ - Enable [ ` --server.export-read-write-metrics ` ] ( ../components/arangodb-server/options.md#--serverexport-read-write-metrics ) for additional document read/write metrics
63+ - Consider enabling [ ` --server.export-shard-usage-metrics ` ] ( ../components/arangodb-server/options.md#--serverexport-shard-usage-metrics ) for detailed shard usage tracking
64+ - Configure your monitoring system (Prometheus/Grafana) to scrape the ` /_admin/metrics/v2 ` endpoint
65+ - See [ HTTP interface for server metrics] ( ../develop/http-api/monitoring/metrics.md ) for detailed information
66+
67+ - ** Enable RocksDB statistics** : Consider enabling [ ` --rocksdb.enable-statistics ` ] ( ../components/arangodb-server/options.md#--rocksdbenable-statistics ) to ` true ` for detailed RocksDB performance metrics.
68+
69+ - Monitor the ArangoDB provided metrics with alerting based on the threshold guidelines:
70+ - Disk usage: 60% (red line)
71+ - CPU usage: 90% (red line)
72+ - Memory usage: 85% (red line)
73+
74+ ### Memory
75+
76+ - For DB-Servers and Coordinators, override the
77+ [ ` ARANGODB_OVERRIDE_DETECTED_TOTAL_MEMORY ` ] ( ../components/arangodb-server/environment-variables.md )
78+ environment variable using this rule of thumb:
79+ - Multiply available memory by 0.9 to leave headspace for OS/Kubernetes, client connections, etc.
80+ - Use 3/4 of that value for DB-Servers.
81+ - Use 1/4 of that value for Coordinators.
82+ - Agents typically don't need much memory and can use the remaining 10% headspace.
83+
84+ - Note that if ArangoDB "sees" x GB of memory in a pod,
85+ it will try to use those x GB. Memory accounting has been vastly improved in 3.12,
86+ but overshooting in certain cases may still occur.
87+
88+ - Disable swap space to avoid slowdown which can result in servers being incorrectly
89+ detected as failed.
90+
91+ - ** Query memory limits** : Configure appropriate memory limits for AQL queries:
92+ - Set [ ` --query.max-memory-per-query ` ] ( ../components/arangodb-server/options.md#--querymax-memory-per-query ) to limit memory usage per individual query.
93+ - Consider setting [ ` --query.global-memory-limit ` ] ( ../components/arangodb-server/options.md#--queryglobal-memory-limit ) to limit total memory used by all concurrent queries.
94+
95+ ### Service Management
4196
4297- Ensure ArangoDB will be automatically restarted (e.g. by using a systemd service file). Typically
4398 you would use the Kubernetes operator or use systemd to launch the _ Starter_ .
@@ -50,36 +105,56 @@ have been performed on your production system before you go live.
50105 update-rc.d -f arangodb3 remove
51106 ```
52107
53- - If you have deployed a Cluster, the _ replication factor_ and
54- _ minimal_replication_factor_ of your collections
55- are set to a value equal or higher than 2, otherwise you run the risk of
56- losing data in case of a node failure. See
57- [ cluster startup options] ( ../components/arangodb-server/options.md#cluster ) .
58-
59- - * Disk Performance considerations*
60- - Verify that your ** storage performance** is at least 100 IOPS for each
61- volume in production mode. This is the bare minimum and it's recommended to
62- provide more for performance. It is probably only a concern if you use a
63- cloud infrastructure. Note that IOPS might be allotted based on a volume size,
64- so make sure to check your storage provider for details. Furthermore, you should
65- be careful with burst mode guarantees as ArangoDB requires a sustainable
66- high IOPS rate.
67-
68- - The considerations should be given to an IO bandwidth (especially considering
69- RocksDB write-amplification which can easily be 10x or more).
70-
71- - Whenever possible use ** block storage** . Database data is based on append
72- operations, so filesystem which support this should be used for best
73- performance. We would not recommend to use NFS for performance reasons,
108+ ### Cluster Configuration
109+
110+ - ** Replication configuration** : For production clusters, configure collections with:
111+ - _ replication factor_ of 3 for optimal data availability and fault tolerance.
112+ - _ minimal_replication_factor_ of a value equal or higher than 2.
113+ - _ writeConcern_ of 2.
114+ See [ cluster startup options] ( ../components/arangodb-server/options.md#cluster ) .
115+
116+ - ** Shard limits** : Keep the total number of shards below 10,000 across your cluster
117+ to maintain optimal performance and avoid resource exhaustion.
118+
119+ ### Disk Performance
120+
121+ - ** Storage performance** : Verify that your storage performance is at least 100 IOPS for each
122+ volume in production mode. This is the bare minimum and it's recommended to
123+ provide more for performance. It is probably only a concern if you use a
124+ cloud infrastructure. Note that IOPS might be allotted based on a volume size,
125+ so make sure to check your storage provider for details. Furthermore, you should
126+ be careful with burst mode guarantees as ArangoDB requires a sustainable
127+ high IOPS rate.
128+
129+ - ** DB-Server storage limit** : Keep individual DB-Server storage below 2TB per server to maintain optimal performance.
130+
131+ - ** I/O bandwidth** : Give considerations to I/O bandwidth, especially considering
132+ RocksDB write-amplification which can easily be 10x or more.
133+
134+ - ** Block storage** : Whenever possible use block storage. Database data is based on append
135+ operations, so filesystems which support this should be used for best
136+ performance. ArangoDB does not recommend using NFS for performance reasons,
74137 furthermore we experienced some issues with hard links required for
75138 Hot Backup.
76139
77- - Verify your ** Backup** and restore procedures are working.
140+ ### Backup and Recovery
141+
142+ - ** Test restore procedures** : Verify your backup and restore procedures are working.
143+ ** TEST YOUR RESTORE PROCEDURE** regularly to ensure you can recover from failures.
144+
145+ - ** Hot Backup frequency** : Take Hot Backups with a frequency that matches your
146+ RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements.
147+
148+ - ** arangodump backups** : Take backups with arangodump from time to time as an
149+ additional backup strategy alongside Hot Backups.
78150
79- - Consider enabling [ Encryption at Rest] ( ../operations/security/encryption-at-rest.md ) .
80- Make sure to safely store any secret keys you create for this.
151+ - ** Secure backup storage** : Store backups in a secure, separate location from your
152+ production systems. Use encrypted storage and ensure backups are geographically
153+ distributed to protect against regional disasters. Implement proper access controls
154+ for backup storage locations.
81155
82- - Monitor the ArangoDB provided metrics (e.g. by using Prometheus/Grafana).
156+ - ** Retry mechanisms** : Implement exponential retry with jitter in your applications
157+ when connecting to ArangoDB to handle temporary network issues and failovers gracefully.
83158
84159## Kubernetes Operator (kube-arangodb)
85160
@@ -89,4 +164,4 @@ have been performed on your production system before you go live.
89164- The [ ** ReclaimPolicy** ] ( https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming )
90165 of your persistent volumes should be set to ` Retain ` to prevent volumes from premature deletion.
91166
92- - Use native networking whenever possible to reduce delays.
167+ - Use native networking whenever possible to reduce delays.
0 commit comments