Here is my mongo cluster (sharding with replicaset) configuration.
replica sets:
rs0 - IP1, IP2, IP3 || port - 27017
rs1 - IP4, IP5, IP6 || port - 27017
config server replica set - IP7, IP8, IP9 || port - 26017
mongos - IP7, IP8, IP9 || port - 26000
This is a test setup and the configuration was setup using IPs(not hostnames). Unfortunately, all host were down following a maintenance & all host IPs changed when we brought the nodes up. Obviously replica set(mongod), config server(mongod) and mongos didn't come up due to unreachable IP addresses.
To bring up the setup, I did the following
- Updated replica set host IP addresses following https://www.mongodb.com/docs/v4.2/tutorial/change-hostnames-in-a-replica-set/
- Updated config server replica set host IPs following the same mongo document. Started mongod services w/o sharding.
- Didn't find any proper documentation around changing config server & mongos IP address/hostname change. On config server replica set, updated "shards" collection in config db.
cfg1 = db.shards.findOne( { "_id": "rs0" } )
cfg1.host = "rs0/new_IP1:27017,new_IP2:27017,new_IP3:27017"
db.shards.update({ "_id" : "rs0" } , cfg1 )
cfg2 = db.shards.findOne( { "_id": "rs1" } )
cfg2.host = "rs1/new_IP3:27017,new_IP4:27017,new_IP5:27017"
db.shards.update({ "_id" : "rs1" } , cfg2 )
- Started config server and mongos properly.
- Now restarting replicaset members to make use of shading. However the replica set mongod processes are not starting citing references to old config server replica set IPs. Following error I am getting on mongod.log.
2022-05-17T21:20:39.654 0530 W SHARDING [initandlisten] Error initializing sharding state, sleeping for 2 seconds and trying again :: caused by :: FailedToSatisfyReadPreference: Error loading clusterID :: caused by :: Could not find host matching read preference { mode: "nearest" } for set csrs
2022-05-17T21:20:40.154 0530 I ASIO [ReplicaSetMonitor-TaskExecutor] Connecting to x.x.x.x:26017
2022-05-17T21:20:41.655 0530 I ASIO [ReplicaSetMonitor-TaskExecutor] Connecting to y.y.y.y:26017
2022-05-17T21:20:42.660 0530 I ASIO [ReplicaSetMonitor-TaskExecutor] Failed to connect to z.z.z.z:26017 - HostUnreachable: Error connecting to 10.0.13.206:26017 :: caused by :: No route to host
I couldn't find any help on web to recover from this scenario. Requesting assistance in recovering the setup without loosing any data as we have loaded TBs of data on this cluster.
CodePudding user response:
I run this procedure as test on my local machine. It seems to work, but I cannot guarantee anything.
- Stop all
mongod/mongos
services on all nodes
mongod Config ReplicaSet
- Start one mongod config server in maintenance mode
- Drop
local
database - Update
config.shards
- Shutdown mongod
- Delete
dbPath
of all other config servers - Start all mongod config servers
- Connect to first mongod config server
- Initiate ReplicaSet
Example (Windows style):
SET MAINTENANCE_LOG=--logpath C:\MongoDB\log\mongo_maintenance.log --logappend
SET MAINTENANCE_NET=--bind_ip localhost --port 55555
SET MAINTENANCE_MISC=--setParameter skipShardingConfigurationChecks=true --setParameter disableLogicalSessionCacheRefresh=true
start mongod --dbpath C:\MongoDB\data\mongocfg_1 %MAINTENANCE_LOG% %MAINTENANCE_MISC% %MAINTENANCE_NET%
mongo --norc localhost:55555/admin
db.getSiblingDB('local').dropDatabase()
db.getSiblingDB('config').getCollection("shards").updateOne(
{_id : "shard_01"},
{$set: {host: "shard_01/<new_IP:port>,<new_IP:port>" }}
)
db.getSiblingDB('config').getCollection("shards").updateOne(
{_id : "shard_02"},
{$set: {host: "shard_02/<new_IP:port>,<new_IP:port>" }}
)
db.getSiblingDB('config').getCollection("shards").updateOne(
{_id : "shard_03"},
{$set: {host: "shard_03/<new_IP:port>,<new_IP:port>" }}
)
db.getSiblingDB('admin').shutdownServer()
exit
rmdir C:\MongoDB\data\mongocfg_2
rmdir C:\MongoDB\data\mongocfg_3
net start MongoDB_Config_1
net start MongoDB_Config_2
net start MongoDB_Config_3
mongo "mongodb://user:password@localhost:27029/admin?authSource=admin"
rs.initiate(
{
_id: "configRepSet",
configsvr: true,
members: [
{ _id: 0, host: "<new_IP:port>", priority: 10 },
{ _id: 1, host: "<new_IP:port>", priority: 5 },
{ _id: 2, host: "<new_IP:port>", priority: 5 }
]
}
)
rs.status()
while (! db.hello().isWritablePrimary ) { sleep(1000) }
exit
mongod Shard ReplicaSet
Repeat below for each shard
- Start one mongod shard server (preferable the former PRIMARY) in maintenance mode
- Drop
local
database - Update
admin.system.version
- Shutdown mongod
- Delete
dbPath
of all other shard servers - Start all mongod shard servers
- Connect to first mongod shard server
- Initiate ReplicaSet
Example (Windows style):
SET MAINTENANCE_LOG=--logpath C:\MongoDB\log\mongo_maintenance.log --logappend
SET MAINTENANCE_NET=--bind_ip localhost --port 55555
SET MAINTENANCE_MISC=--setParameter skipShardingConfigurationChecks=true --setParameter disableLogicalSessionCacheRefresh=true
start mongod --dbpath C:\MongoDB\data\mongoshard_1prim %MAINTENANCE_LOG% %MAINTENANCE_MISC% %MAINTENANCE_NET%
mongo --norc localhost:55555/admin
db.getSiblingDB('local').dropDatabase()
db.getSiblingDB('admin').getCollection("system.version").updateOne(
{_id : "shardIdentity"},
{$set: { configsvrConnectionString: "configRepSet/<new_IP:port>,<new_IP:port>,<new_IP:port>" }}
)
db.getSiblingDB('admin').shutdownServer()
exit
rmdir C:\MongoDB\data\mongoshard_1sec\*
rmdir C:\MongoDB\data\mongoshard_1arb\*
net start MongoDB_Shard_1prim
net start MongoDB_Shard_1sec
net start MongoDB_Shard_1arb
mongo "mongodb://user:password@localhost:37028/admin?authSource=admin"
rs.initiate(
{
_id: "shard_01",
members: [
{ _id: 0, host: "<new_IP:port>", priority: 10 },
{ _id: 1, host: "<new_IP:port>", priority: 5 },
{ _id: 2, host: "<new_IP:port>", arbiterOnly: true }
]
}
)
rs.status()
while (! db.hello().isWritablePrimary ) { sleep(1000) }
exit
mongos Router
This one the the simplest part.
- Edit mongos config file and put new
sharging.configDB
string - Start
mongos
CodePudding user response:
The issue is solved now.
The final piece of puzzle was to find where was the config server connection info saved in replica set mongod. It's in system.version
collection under admin db. I had to follow the following steps
- Start the mongod on all replicaset members with security authorization, replication and sharding disabled. Made necessary changed on config file.
- Under admin db, the following two documents in
system.version
had the config server connection string.
db.system.version.find( {"_id" : { $in : [ "shardIdentity" , "minOpTimeRecovery" ]} })
- Updated both the documents with new config server connection string via
db.system.version.update
command. - Shut down the mongod processes and enabled security authorization, replication and sharding in the mongod config file.
- Successfully started replica set mongod instances.
Note : I am new to mongo and not sure if we should be making changes to internal system collections. Since it was a test setup, I took the risk and did these experiments which paid off. Its not recommended on a production environment a resolution can't be guaranteed.