Recommendation: Have an Incident Leader
Often a thread of investigation can be cut off before it bears fruit because someone on the call has a different opinion of the potential problem or simply doesn’t understand the current investigation and pushes their own agenda.
Recommendation: Keep an Incident Log
Recommendation: Have Architectural Documents Ready
Recommendation: Progress through well defined Incident Stages
Recommendation: Upload all available logs to a Support Ticket as soon as possible
Recommendation: Have Additional Servers ready to deploy in each environment
Recommendation: Properly Size your Machines & Cluster
Recommendation: Triage from the Top Down
Recommendation: Avoid Bias
Recommendation: Test Tickets
To start with, when building an internal tier one support team it is important that those providing the support review and understand all of the following guides and best practices.
If the use case involves query:
With a strong understanding of those guides, one should be able to provide support for most tier one support issues.
Sometimes, client to Couchbase interactions can timeout. If you are having issues with timeouts, using the open-tracing API may help determine the cause or where the timeout is actually happening. How to use open-tracing is language dependent. The exposure of these APIs will allow you to use other tools like jaegertracing or other commercial products to dig into where the timeouts are happening.
Response Time Observability (RTO) is part of open-tracing and as four parts:
Data Service Tracing must first be enabled to use RTO on the data service.
curl -u Administrator:password -X POST localhost:8091/pools/default/settings/memcached/global \
--data tracing_enabled=true/false
Which then gives the following messages in the memcached.log
2018-07-17T15:30:05.308518Z INFO 37: HELO [{"a":"libcouchbase/2.9.2 (Darwin-17.6.0; x86_64; \
Clang 9.1.0.9020039)","i":"00000000b35f58aa/39cc1fd4eeaf1d67"}] \
TCP nodelay, XATTR, XERROR, Select bucket, Snappy, JSON, Tracing \
[ 127.0.0.1:55200 - 127.0.0.1:11210 (not authenticated) ]
...
2018-07-17T15:30:07.084239Z WARNING 37: Slow operation. \
{"cid":"00000000b35f58aa/39cc1fd4eeaf1d67/0","duration":"1771 ms", \
"trace":"request=329316910555375:1771975 \
get=329316911686879:35 \
bg.wait=329316911701108:4715 \
bg.load=329318677724103:1766022 \
get=329318682489143:21",
"command":"GET","peer":"127.0.0.1:55200"}
Name | Definition | Comments |
---|---|---|
request | From KV-Engine receiving request (from OS) to sending response (to OS). | Overall KV-Engine view of the request. |
bg.wait | From KV-Engine detecting background fetch needed, to starting to read from disk. | Long duration suggests contention on Reader Threads. |
bg.load | From KV-Engine starting to read from disk to loading the document. | Long duration suggests slow disk subsystem. |
get | From KV-Engine parsing a GET request to processing it. | For docs which are not resident, you'll see two instances of this span. |
get.if | From KV-Engine parsing a GET _IF request to processing it. | Similar to get, but used for certain request (e.g. XATTR handling) |
get.stats | From KV-Engine parsing a STATS request to processing it. | Different STATS keys have different costs (e.g. disk stats are slower). |
store | From KV-Engine parsing a STORE request to processing it. | Typically fast (as only has to write into HashTable). |
set.with.meta | From KV-Engine parsing a SET_WITH_META request to processing it. | Similar to store but for XDCR / restore updates. |
SDK Logging is on by default. Below is an example from an RTO Log.
{
"service": "kv",
"count": 15,
"top": [
{
"operation_name": "get",
"last_operation_id": "0x21",
"last_local_address": "10.211.55.3:52450",
"last_remote_address": "10.112.180.101:11210",
"last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
"total_duration_us": 18908,
"encode_us": 256,
"dispatch_us": 825,
"decode_us": 16,
"server_duration_us": 14
}
]
}
Name | Definition | Comments |
---|---|---|
operation_name | Type of operation | |
last_operation_id | A combination of type of operation and ID | Useful for troubleshooting in combination with the local_id |
last_local_address | The local socket used for this operation | |
last_remote_address | Socket used on server for this request | Useful when determining which node processed this request |
last_local_id | This ID is negotiated with the server | Can be used to correlate logging information on both sides |
total_duration_us | The total time it took to perform the full operation | |
encode_us | Time the client took to encode the request | Is longer the larger and more complex the json |
dispatch_us | Time from when client to sent the request to when it got a response into the clients ring buffer | Amount of time spent traversing the network can be found by subtracting dispatch_us - server_duration_us |
decode_us | Time the client took to decode the response | Is longer the larger and more complex the json |
server_duration_us | Time the server took to do its work. | - |
This is enabled by default in the SDK. This aggregates responses where the request has timed out. ie KV get exceeds time out, error returned to application, response received some time afterwards. The log interval is 10 seconds at the WARN level, and has a Per-service sample size of 10. Below is an example of an orphan log.
{
"service": "kv",
"count": 2,
"top": [
{
"s": "kv:get",
"i": "0x21",
"c": "66388CF5BFCF7522/18CC8791579B567C",
"b": "default",
"l": "192.168.1.101:11210",
"r": "10.112.181.101:12110",
"d": 120
}
]
}
Name | Definition |
---|---|
s | Service type |
i | Operation ID |
c | Connection ID |
b | Bucket |
l | LocalEndpoint and Port |
r | Remote Endpoint and Port |
d | Duration (us) |
t | Timeout |
Combining these three logs, a troubleshooter should be able to trace timings all the way from the client to server and back. This should help identify any operations that timed out, or are performing slowly. Providing this information to support will greatly help in determining the cause of timeouts.
On each Couchbase node there are multiple services running to maintain Couchbase and provide the services needed for operation. At times, and normally with the guidance of support, it may be necessary to restart some of those services in a recovery effort from degraded operations. A complete list of processes can be found at https://docs.couchbase.com/server/current/install/server-processes.html
What it does: Provides admin UI & RESTful administrative APIs, coordinates cluster membership, manages rebalancing, etc.
Why to restart: Connections not being released, rebalance that won’t ‘stop’
Impact of restarting:
How:
curl --data “erlang:halt().” -u <Administrator>:<password> http://<host>:8091/diag/eval
What it does: This is the data service. It manages all the vbuckets and the data within the vbuckets.
Why to restart:
Impact of restarting: Causes a hard failover of the data node.
How:
pkill memcached
What it does: This process runs on each Data node. It consumes DCP messages, Filters and formats index entries, and sends those messages to the indexer.
Why to restart:
Impact of restarting:
How:
pkill projector
What it does: The indexer runs on each Index node. It manages index records in memory & disk. This process is invoked by query engine to select query candidates.
Why to restart:
Impact of restarting:
How:
pkill indexer
What it does: The query service runs on each Query node. Each N1QL query is assigned to one node. Calls from the query service go out to Index service, Data Service and FTS Service as required
Why to restart:
Impact of restarting:
How:
pkill cbq-engine
What it does: the goxdcr process runs on each Data node. It has responsibilities for data replication work done on both source & target side.
Why to restart:
Impact of restarting:
How:
pkill goxdcr