Install the app
How to install the app on iOS

Follow along with the video below to see how to install our site as a web app on your home screen.

Note: This feature may not be available in some browsers.

Storage Pool Crash - Advice Needed

748
397
NAS
RS1221+, RS819, RS217
Operating system
  1. macOS
Mobile operating system
  1. iOS
Hello all,

My RS1221+ suffered a Storage Pool crash today so look for advice on how to recover it as the symptoms of this one are new to me.

Configuration:

Storage Pool 1, Volume 1 = 6 x SSDs in SHR1
Storage Pool 2, Volume 2 = 2 x HDDs in SHR (mirrored backup of Volume 1)

Event:

Overnight the grid power was going nuts for a while and so the UPS was kicking in and out repeatedly. Eventually DSM threw an error on SSD6 as it happened to be doing a 6-monthly data scrub at the time and I guess the constant commands from the UPS were just too much.

No drama, (I thought this morning) I just clicked on the 'repair' and last time I looked it was almost done. Easy.

Then the grid went wild again later in the day and after constant UPS interruptions DSM spat-out SSD2. Ouch. All services stopped, containers, backups, home assistant, the lot. At least I got the multiple warning emails.

The bit that is new to me is that when I try and access the NAS I get as far as the login screen, then the password screen, before it just spins away for minutes and dumping to a white web screen.

I'm not so worried about the data on Storage Pool 1 but it I can still recover it from where it resides that would be ideal. Otherwise using the backups stored in Storage Pool 2 (last 2 drives in the same device) would be reasonably quick across the backplane - if I could get into DSM. Oddly I can still see the files on the 'bad' storage pool (Pool 1) over SMB.

The last resort would be starting afresh with both storage pools (ie losing the backups on the 2xHDDs in Storage Pool 2, located in the last 2 bays) and recovering everything from scratch from my final backup location, over a slow network connection.

Basically I want to get DSM working again on Storage Pool / Volume 1.

Any ideas from the wise?

☕


[Yes, losing Home Assistant is causing panic etc...]

This post includes affiliate links. As an Amazon Associate, SynoForum.com may earn a commission if you make a purchase — at no extra cost to you.
It helps support our community! Learn more...

 
Last edited:
When all is said & done, with your drives….
You really should look at UPS. No matter how many times your power should Intermittently drop & return, as long as it’s seconds, not minute’s at a time — your UPS should handle frequent random drops. As long as we are not talking 10x 5 min drops — your symptoms sound like weakened UPS batteries.
I developed a simple afternoon project: plug in LED Flasher for UPS to get around the “not knowing when or if” this has happened. Doesn’t prevent it:from failing - but If you lose AC POWER at UPS output: ONCE, you know for certain. Unfortunately, when a UPS reboots, it removes any memory if it has just failed because batteries are weak! How old are your UPS Batts?
 
Upvote 0
When all is said & done, with your drives….
You really should look at UPS.

Thanks but really looking for good DSM recovery advice, not UPS. The UPS is fine and, as said, I think the NAS got confused with so many UPS commands, so nothing to do with the UPS power delivery.

At the moment my NAS is just spinning on this page:

 2026-05-17 at 09.51.16.png


☕
 
Upvote 0
Hi Rusty,

SSH port is disabled. Usually I only enable it when needed... which suddenly feels like a bad move.

I'm presuming DSM itself is corrupted on the failed SSD array but probably intact on the HDD array. Kinda need a way to repair DSM itself or boot from a good version.

☕
 
Upvote 0
Ok, through power cycles and the dreaded paperclip I'm back into DSM. For whatever reason DSM wanted to check volume 2 first, so that will be running a 'repair' through to tomorrow, at the very least:

 2026-05-17 at 17.09.11.png


Allegedly 2 x SSDs are 'Critical' on Storage Pool / Volume 1 but good enough for services and packages to run, which is a bit of a surprise.

Hopefully I can find a way to turn 'Critical' SSD back into normal in due course.

Didn't need any of this today as the wife is ill.

☕

Oh and I turned the SSH access back on - just in case!
 
Upvote 0
Glad to hear you are back on track. Still, I fear you will need both SSDs replaced with multiple repairs in between. There is a risk of another SSD failing, so it's good that volume 2 is still going strong.

I do hope that Rivendell will have peace and serenity once again and not turn into Mordor in the coming days.
 
Upvote 0
Bag End is reasonably peaceful right now and I am sure I can get the SSDs to work just fine; if I can find a way of convincing or neuralyzing DSM's memory of the 'critical' failures then all will be well.

☕
 
Upvote 0
Last edited:
You can try deleting the directory containing the information about the specific SSD while the StorageManager GUI is still opened.

via SSH root like this:

rm -R /run/synostorage/disks/"devicename"

You will find the "devicename" with:

cat /proc/mdstat
(Let us see this output, too.)


Good Speed!

otherwise: check the smartdisk-values there with:
smartctl -x -d sat /dev/"devicename"
(sat for SATA/scsi for SAS)
 
Upvote 0
Thanks, as my previous technique of deleting the sqlite logs for the drives shown as critical didn't this time.

Code:
root@Rivendell:~# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raidF1]
md2 : active raid5 sata1p5[5] sata4p5[6] sata2p5[4] sata8p5[3] sata6p5[7]
      4859657280 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]
      
md3 : active raid5 sata3p6[0] sata1p6[4] sata2p6[3] sata8p6[2] sata6p6[6]
      14651234560 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [UUUUU_]
      
md4 : active raid1 sata5p6[2] sata7p6[1]
      3901086976 blocks super 1.2 [2/2] [UU]
      
md5 : active raid1 sata5p5[2] sata7p5[3]
      17573494400 blocks super 1.2 [2/2] [UU]
      
md1 : active raid1 sata1p2[0] sata7p2[5] sata2p2[4] sata8p2[3] sata6p2[2] sata5p2[1]
      2097088 blocks [8/6] [UUUUUU__]
      
md0 : active raid1 sata1p1[0] sata7p1[5] sata2p1[4] sata8p1[3] sata6p1[2] sata5p1[1]
      2490176 blocks [8/6] [UUUUUU__]
      
unused devices: <none>
root@Rivendell:~#

The 2 SSDs impacted are bays 2 and 6. Not positively sure how to ID the correct "devicename" in the output above.

☕
 
Upvote 0
Last edited:
Okay, then install two software packages first, if applycable.
- synocli-disk
package source is the community
- SynoSmartInfo from here:

Running SynoSmartInfo as a new application, you are now able to compare the serial number with the devicename shown there.
The problematic HDD/SSD serial number is to be retrieved via the StorageManager.
 
Upvote 0
Running SynoSmartInfo application also displays the model of the device. Which are they, or are you sure it's SATA or SAS?
 
Upvote 0
The new command for smart complete listing would therefore be:
smartclt7 -x -d sat /dev/sataX
or
smartct7 -x -d scsi /dev/sataX
X = devicenumber
 
Upvote 0
md2 : active raid5 sata1p5[5] sata4p5[6] sata2p5[4] sata8p5[3] sata6p5[7]
4859657280 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [U_UUUU]

md3 : active raid5 sata3p6[0] sata1p6[4] sata2p6[3] sata8p6[2] sata6p6[6]
14651234560 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [UUUUU_]
here are one drive each missing in raid5 of volume1 and volume2
 
Upvote 0
Installed the package - thank you. 👍

Code:
Drive 2  WDC WDS400T2B0A-00SM50  1926A4800885  /dev/sata3
SMART overall-health self-assessment test result: PASSED
SMART Error Counter Log:         No Errors Logged
  5 Reassigned_NAND_block_count  0
  9 Power-On_Count               55837
187 Reported_Uncorrectable_Error 0
188 Command_Timeout_Count        95
194 Temperature                  24
199 CRC_Error_Count              0

Code:
Drive 6  WDC WDS400T2B0A-00SM50  2052F5420364  /dev/sata4
SMART overall-health self-assessment test result: PASSED
SMART Error Counter Log:         No Errors Logged
  5 Reassigned_NAND_block_count  0
  9 Power-On_Count               38317
187 Reported_Uncorrectable_Error 0
188 Command_Timeout_Count        4
194 Temperature                  24
199 CRC_Error_Count              0

So looks like sata3 and sata4

Does that make the"devicename" /dev/sata3

☕
 
Upvote 0
The WDS400T2B0A are SATA devices.

smartclt7 -x -d sat /dev/sata3
smartclt7 -x -d sat /dev/sata4
 
Upvote 0
I am puzzled, since SATA4 with Partition 5 is still part of Volume1, and SATA3 with Partition 6 is still part of Volume2.
 
Upvote 0
The output:

Code:
root@Rivendell:~# smartctl -x -d sat /dev/sata3
smartctl 6.5 (build date Sep 26 2022) [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     WD Blue
Device Model:     WDC  WDS400T2B0A-00SM50
Serial Number:    1926A4800885
LU WWN Device Id: 5 001b44 8b8af4260
Firmware Version: 411030WD
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   Unknown(0x0ff0), ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA >3.2 (0x1ff), 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May 18 14:11:42 2026 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x11) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  10) minutes.

SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME                                                   FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reassigned_NAND_block_count                                      -O--CK   100   100   ---    -    0
  9 Power-On_Count                                                   -O--CK   100   100   ---    -    55837
 12 Drive_Power_Cycle_Count                                          -O--CK   100   100   ---    -    37
169 Total_Bad_Blocks                                                 -O--CK   100   100   ---    -    2976
170 Grown_Bad_Blocks                                                 -O--CK   100   100   ---    -    0
171 Program_Fail_Count                                               -O--CK   100   100   ---    -    0
172 Erase_Fail_Count                                                 -O--CK   100   100   ---    -    0
174 Unexpected_Power_Loss_Count                                      -O--CK   100   100   ---    -    10
187 Reported_Uncorrectable_Error                                     -O--CK   100   100   ---    -    0
188 Command_Timeout_Count                                            -O--CK   100   100   ---    -    95
194 Temperature                                                      -O---K   076   052   ---    -    24 (Min/Max 19/52)
199 CRC_Error_Count                                                  -O--CK   100   100   ---    -    0
230 Media_Wearout_Indicator                                          -O--CK   005   005   ---    -    5557726479630
232 Remaining_Lifetime_Perc                                          PO--CK   100   100   004    -    100
233 NAND_GB_Written_TLC                                              -O--CK   100   100   ---    -    117154
234 NAND_GB_Written_SLC                                              -O--CK   100   100   ---    -    160572
241 Total_GB_Written                                                 ----CK   253   253   ---    -    98867
242 Total_GB_Read                                                    ----CK   253   253   ---    -    479822
244 Temperature_Throttle_Status                                      -O--CK   000   100   ---    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      2  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x24       GPL     R/O   2789  Current Device Internal Status Data log
0x25       GPL     R/O   2789  Saved Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xde       GPL     VS       8  Device vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     55817         -
# 2  Short offline       Completed without error       00%     55816         -
# 3  Short offline       Completed without error       00%     54365         -
# 4  Short offline       Completed without error       00%     54197         -
# 5  Short offline       Completed without error       00%     54029         -
# 6  Short offline       Completed without error       00%     53861         -
# 7  Short offline       Completed without error       00%     53693         -
# 8  Short offline       Completed without error       00%     53525         -
# 9  Short offline       Completed without error       00%     53357         -
#10  Short offline       Completed without error       00%     52181         -
#11  Short offline       Completed without error       00%     52013         -
#12  Short offline       Completed without error       00%     51845         -
#13  Short offline       Completed without error       00%     51677         -
#14  Short offline       Completed without error       00%     51509         -
#15  Short offline       Completed without error       00%     51005         -
#16  Short offline       Completed without error       00%     50836         -
#17  Short offline       Completed without error       00%     50668         -
#18  Short offline       Completed without error       00%     50500         -
#19  Short offline       Completed without error       00%     50332         -

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              37  ---  Lifetime Power-On Resets
0x01  0x010  4           55837  ---  Power-on Hours
0x01  0x018  6    207340659546  ---  Logical Sectors Written
0x01  0x020  6      3253887961  ---  Number of Write Commands
0x01  0x028  6   1006261167601  ---  Logical Sectors Read
0x01  0x030  6    148633985860  ---  Number of Read Commands
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               2  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x000a  4            2  Device-to-host register FISes sent due to a COMRESET

root@Rivendell:~#
 
Upvote 0
To be sure, one screenshot from the StorageManager HDD/SSD display, please.
 
Upvote 0

Create an account or login to comment

You must be a member in order to leave a comment

Create account

Create an account on our community. It's easy!

Log in

Already have an account? Log in here.

Popular tags from this forum

Welcome to SynoForum.com!

SynoForum.com is an unofficial Synology forum for NAS owners and enthusiasts.

Registration is free, easy and fast!

Trending content in this forum

Back
Top