Hey! So I ran memsql-admin register-node -y --json --force ...
and then memsql-admin start-node -y --memsql-id <ID>
against a leaf node (but on the master aggregator node of course) last night and when I woke up the leaf node was still in the “Recovering” state, so I called start-node
again and waited for about 10 minutes. Here’s what memsql-admin list-nodes
shows me:
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| MemSQL ID | Role | Host | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
| 967171DBBC | Master | 10.0.45.3 | 3306 | Running | True | 6.8.11 | Online | | 0.0.0.0 |
| 101B4127FD | Leaf | 10.0.39.177 | 3306 | Running | True | 6.8.11 | Recovering | 1 | 0.0.0.0 |
+------------+--------+-------------+------+---------------+--------------+---------+----------------+--------------------+--------------+
And here’s what sudo tail -n 100 /mnt/raid/memsql/tracelogs/memsql.log
shows me so far:
133270214 2019-10-31 15:02:19.696 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
139790686 2019-10-31 15:02:26.217 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`ivt`
143277894 2019-10-31 15:02:29.704 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
153284812 2019-10-31 15:02:39.711 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
159675906 2019-10-31 15:02:46.102 INFO: Recovering secondary key `deft_1`.`datoms_string`.`eavt`
162415437 2019-10-31 15:02:48.841 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`eavt`
163297285 2019-10-31 15:02:49.723 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
166546699 2019-10-31 15:02:52.973 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`avt`
173305009 2019-10-31 15:02:59.731 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
174770068 2019-10-31 15:03:01.196 INFO: Recovering secondary key `deft_1`.`datoms_string`.`avt`
180796892 2019-10-31 15:03:07.223 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`vt`
183314803 2019-10-31 15:03:09.741 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
193329751 2019-10-31 15:03:19.756 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
203011969 2019-10-31 15:03:29.438 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`tt`
203347296 2019-10-31 15:03:29.773 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
208494557 2019-10-31 15:03:34.921 INFO: Recovering secondary key `deft_1`.`datoms_string`.`vt`
208868212 2019-10-31 15:03:35.294 INFO: Recovering secondary key `deft_2`.`datoms_ref`.`p`
212574665 2019-10-31 15:03:39.001 INFO: Preparing to recover secondary keys for table `deft_2`.`datoms_short`
212574729 2019-10-31 15:03:39.001 INFO: Recovering secondary key `deft_2`.`datoms_short`.`ieavt`
212575379 2019-10-31 15:03:39.001 INFO: Recovering secondary key `deft_2`.`datoms_short`.`iavt`
212575804 2019-10-31 15:03:39.002 INFO: Recovering secondary key `deft_2`.`datoms_short`.`ivt`
212576035 2019-10-31 15:03:39.002 INFO: Recovering secondary key `deft_2`.`datoms_short`.`eavt`
212576386 2019-10-31 15:03:39.002 INFO: Recovering secondary key `deft_2`.`datoms_short`.`avt`
212576803 2019-10-31 15:03:39.003 INFO: Recovering secondary key `deft_2`.`datoms_short`.`vt`
212577053 2019-10-31 15:03:39.003 INFO: Recovering secondary key `deft_2`.`datoms_short`.`tt`
212577469 2019-10-31 15:03:39.003 INFO: Recovering secondary key `deft_2`.`datoms_short`.`p`
212577863 2019-10-31 15:03:39.004 INFO: Preparing to recover secondary keys for table `deft_2`.`datoms_string`
212765739 2019-10-31 15:03:39.192 INFO: Recovering secondary key `deft_2`.`datoms_string`.`ieavt`
213354939 2019-10-31 15:03:39.781 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
220860837 2019-10-31 15:03:47.287 INFO: Recovering secondary key `deft_2`.`datoms_string`.`iavt`
223367782 2019-10-31 15:03:49.794 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
233382269 2019-10-31 15:03:59.808 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
243391441 2019-10-31 15:04:09.817 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
245806903 2019-10-31 15:04:12.233 INFO: Recovering secondary key `deft_2`.`datoms_string`.`ivt`
253402606 2019-10-31 15:04:19.829 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 2 out of 4
…
295026679 2019-10-31 15:05:01.453 INFO: Clearing snapshot version for database `deft_1` as it is about to go online.
295062983 2019-10-31 15:05:01.489 INFO: RecoverDatabaseThread[deft_1]: End
295483958 2019-10-31 15:05:01.910 INFO: STOP REPLICATING deft_1
295484017 2019-10-31 15:05:01.910 INFO: Thread 115083: StopSlaveThreads: Stopping slave thread for database `deft_1`.
295484070 2019-10-31 15:05:01.910 INFO: Completed STOP REPLICATING deft_1
295484093 2019-10-31 15:05:01.910 INFO: DB deft_1: has original Id 11395525453320891103.
295484104 2019-10-31 15:05:01.910 INFO: PROMOTE ALL LOCKS RELEASED deft_1
303494218 2019-10-31 15:05:09.920 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 3 out of 4
…
423633071 2019-10-31 15:07:10.059 ERROR: Thread 115068: Fn: Some error happened during Replication Management. Processed 3 out of 4
426466282 2019-10-31 15:07:12.892 INFO: Recovering secondary key `deft_2`.`datoms_string`.`p`
430241989 2019-10-31 15:07:16.668 INFO: [deft_2] Scanning columnstore segments: Completed, elapsed: 0 ms, disk: 0 ms
430242484 2019-10-31 15:07:16.668 INFO: Clearing snapshot version for database `deft_2` as it is about to go online.
430279079 2019-10-31 15:07:16.705 INFO: RecoverDatabaseThread[deft_2]: End
430279189 2019-10-31 15:07:16.705 INFO: Completed recovery of user databases.
430640805 2019-10-31 15:07:17.067 INFO: STOP REPLICATING deft_2
430640881 2019-10-31 15:07:17.067 INFO: Thread 115080: StopSlaveThreads: Stopping slave thread for database `deft_2`.
430640930 2019-10-31 15:07:17.067 INFO: Completed STOP REPLICATING deft_2
430640977 2019-10-31 15:07:17.067 INFO: DB deft_2: has original Id 11395525453320891103.
430640995 2019-10-31 15:07:17.067 INFO: PROMOTE ALL LOCKS RELEASED deft_2
The logs do show repeated errors, but looks like there’s been forward progress of some sort. However memsql-admin list-nodes
still shows the leaf as “Recovering”, and it’s been 8 minutes since the log has been written to. I have a pretty grim outlook at the moment since my first thought is that the data is corrupted. Luckily I do have backups, but they’re potentially slightly after the last known good state; we’ll see.
Any insight you can bring to bear on this would be extremely appreciated since we’re dead in the water till we fix this! Thanks!