We are running MongoDB ReplicaSet on Kubernetes. One of MongoDB pods in CrashLoop and it shows OOMKilled as true. And the pod has crashed 234 times since then.
We have one primary and two secondaries.
Here are the latest logs. Container lives around a minute and crashes again. I am trying to understand what the logs mean.
What does OplogStartMissing mean?
145 {"log":"2022-03-08T09:24:44.127 0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1646656464, 1), t: 58 }. source 's GTE: { ts: Timestamp(1646656801, 1), t: 60 } hashes: (2206456552855381608/810867260034420 2316)\n","stream":"stdout","time":"2022-03-08T09:24:44.12744806Z"}
147 {"log":"2022-03-08T09:24:44.127 0000 I REPL[rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method because UUID support is feature compatible with featureCompatibilityVersion 3.6.\n","stream":"stdout","time":"2022-03-08T09:24:44.12747365Z"}
148 {"log":"2022-03-08T09:24:44.127 0000 I REPL[rsBackgroundSync] transition to ROLLBACK from SECONDARY\n","stream":"stdout","time":"2022-03-08T09:24:44.127477084Z"}
149 {"log":"2022-03-08T09:24:44.127 0000 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: mongodb-2.mongodb.maglev-system.svc.cluster.local:27017\n","stream":"stdout","time":" 2022-03-08T09:24:44.127480067Z"}
150 {"log":"2022-03-08T09:24:44.133 0000 I ROLLBACK [rsBackgroundSync] Finding the Common Point\n","stream":"stdout","time":"2022-03-08T09:24:44.133319869Z"}
151 {"log":"2022-03-08T09:24:44.136 0000 I ROLLBACK [rsBackgroundSync] our last optime: Timest amp(1646656464, 1)\n","stream":"stdout","time":"2022-03-08T09:24:44.136901468Z"}
152 {"log":"2022-03-08T09:24:44.136 0000 I ROLLBACK [rsBackgroundSync] their last optime: Timestamp(1646731479, 1)\n","stream":"stdout","time":"2022-03-08T09:24:44.136912166Z"}
153 {"log":"2022-03-08T09:24:44.136 0000 I ROLLBACK [rsBackgroundSync] diff in end of log times: **-75015** seconds\n","stream":"stdout","time":"2022-03-08T09:24:44.136916265Z"}
154 {"log":"2022-03-08T09:24:44.320 0000 I NETWORK [listener] connection accepted from 127.0.0. 1:41476 #2 (1 connection now open)\n","stream":"stdout","time":"2022-03-08T09:24:44.32070222 4Z"}
Especially, diff in the end of log times is negative. What does negative value signify. What does RollBackViaRefetch mean?
CodePudding user response:
OOMKilled - Means the container was killed because it tried to use more memory than you allocated to it in your resources.limits
section.
OplogStartMissing - Most of the time seems to point to your OpLog being too small. Try increasing it.
RollbackViaRefetch - From the documentation:
Nodes go into rollback if after they receive the first batch of writes from their sync source, they realize that the greater than or equal to predicate did not return the last op in their oplog. When rolling back, nodes are in the ROLLBACK state and reads are prohibited. When a node goes into rollback it drops all snapshots. The rolling-back node first finds the common point between its oplog and its sync source's oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them.