Resource icon

Tutorial Pitfalls of different SSDs used within RAID

Intro
To be more precise about the headline term – different SSDs in this case – means different SSDs based on using different SSD's internal architecture or subsystems like:
  • SSD controllers
  • SSD built-in cache
  • SSD cell technologies
and not about the same SSDs with different capacities only.
This doc will be divided into two parts - theoretical and an illustrative example of a bad understanding or lack of information when creating a RAID with different SSDs (according to this description).



The Theoretical part

1. First – about RAID
(simplified):

RAID combines two or more drives (up to the RAID spec) into the Redundant Array of Independent Disks (RAID). I choose only the most essential purposes why RAID exists based on the RAID type selected:
- Capacity. The required, resulting volume capacity that exceeds the available sizes of disk capacities on the market.
- Speed. Acceleration of writing and/or reading.
- Protection. Minimizing errors caused during writing or reading and minimizing data loss caused during write operations. In this case (NAS operation) it is very important how the SW RAID is tied up with the used drive controller and File system. RAID is not immune to data errors.
This document will be devoted to protection purposes.


2. Second – about SSD subsystems (main parts of SSD drive):

2.1 SSD controller


The SSD controller can also be called a CPU (processor). Like the processor, it contains subsystems that, in a simplified view, ensure the orchestration of Flash memory components with SSD I/O interfaces. That controller uses firmware-level software for this orchestration. SSD firmware is specific to a particular drive and cannot be simply replaced with another type of firmware. The firmware itself is often associated with disk malfunction. A few basic elements of the SSD controller that are essential for this document:

Embedded processor – based on microcontroller (one supplier of SSD drives can also use different manufacturers of microcontrollers). For advanced disk users, this is an investigated factor when choosing an SSD.

Flash interface part – this interface is based on standards according to the type of NAND Flash used, e.g. Open NAND Flash Interface (ONFI)

Error Correction Code (ECC) part - such as low-density parity-check (LDPC) code and so on. ECC purpose: to detect and correct the vast majority of errors that can affect data orchestrated during their operation by SSD (written o reading) of a data block to/from the NAND Flash. Each manufacturer has its specifics for this activity, which are considered in the SSD firmware.


2.2 SSD built-in cache

SSD built-in cache means internal SSD cache - is a temporary storage of operating data on internal NAND flash memory. This cache is operated by the SSD controller and not by the Host. In the case of Synology NASes – this cache is not managed by the DSM. Each SSD controller manufacturer has ideas about how the SSD should be operated and how the SSD firmware is adapted to this. This means that two SSDs from the same vendor with two different cache architecture/algorithm/firmware will not behave the same. Not to mention the inclusion of a third variable in this operation, which is SSD DRAM.


2.3 SSD cell technologies

3D NAND TLC is currently the most widely used. However, SSDs with MLC are still available, which are gradually being phased out, and logically also with SLC as the top category. Each of these technologies has its specifics, which fundamentally affect the need for a different operation for the SSD controller. However, some SSD controller manufacturers simplify the process and omit the ECC functionality, thereby achieving faster reading directly from the NAND flash array.


Conclusion No.1:

The use of SSD in RAID is widespread and has its logic.
The use of two different SSDs, which contain two different SSD controllers and even diff cell technologies, is the basis of future problems where there will be a failure of the basics and the reasons why RAID is used.
Therefore, avoid such a setup in your systems (not just within NAS).
You will find many “guaranteed experts and guides” on similar topics on the Internet. The problem is that some of them belong to the 🤬, which is more likely to cause your future (current) problems.



The illustrative example

This post was initiated by a source that came to my attention. For a long time, I have not read such a concentration of wrong information.

I will try to analyze the problem stated in the named source quickly. The basis of the problem was characterized as follows (I quote):

"… after 3 years and 9 months of uninterrupted use, my old Crucial MX500 SSD disks, which I had installed on my old DS718+ NAS in RAID 1, failed me. And it is also today that I learned two new and important things that I would very much like to share with you."

First Thing Learned: …

One of my Crucial MX500 250GB disks which was at 1% lifespan (monitored through my Synology NAS) went critical, but did not die. My other Crucial MX500 250GB disk, which was at 10% lifespan, died on me unexpectedly, without giving any signs of disk failure whatsoever. What did I learn from this experience? That a disk with 1% lifespan does not necessarily die before a disk with 10% lifespan. This goes to show there is no fixed rule when it comes to disks lifespan; but if more than three years of service have passed, you can expect an SSD disk to die on you suddenly, regardless of its lifespan."

As you well know from this forum on this topic:

- Synology has an improperly implemented SMART evaluation and monitoring. More here.

- This means that if DSM signals to you that your drive has "1% lifespan" it does not necessarily mean that your drive really has "1% lifespan", More here.

- Drives checked outside of Synology DSM show different lifespan values. More here.

- Synology takes a vague approach to this and only makes changes based on express pressure, which is not comprehensive but only temporary.

You can track the causes of impending disk death! Just don't rely on DSM and use appropriate tools for it. However, one rule applies, the person who makes a backup (there are many instructions on this forum) does not have a problem with the sudden death of a disk in RAID.

The second nonsense I discovered in the source is:
"Second Thing Learned

Have you ever wondered if you can use different brand SSDs for your RAID 1 configuration on your Synology NAS? Today I wanted to experiment with this and I have an answer: the answer to this question is yes, you can. You can use different brand SSD disks in a RAID 1 configuration on your NAS without any issues. The only real rule to stick by is to respect the disk storage space: if you had two 250GB disks in, you replace them with two 250GB disks – so same size both. You can't put in one 250GB disk and one 240GB or 120GB disk. You can only do this if you are using SHR as your RAID management option."

For this experiment, the user used RAID1 with two SSDs:
  • HP S700 250GB
  • Samsung 870 EVO 250GB

Let's look at some details from the specification of these drives:

DriveControllerDRAM cacheused NAND flash
HP S700 250GBSilicon Motion SM2258XTNone, it’s DRAMless design3D NAND Flash
Samsung 870 EVO 250GBSamsung MKXSamsung 512MB Low Power DDR4 SDRAMV-NAND 3bit MLC

Note - a vulnerability of the DRAMless design:
  • higher write amplification factor = negatively affects storage performance and durability (SSD lifespan).
  • w/o DRAM cache, it's a serious bottleneck for handling data within SSD and impacts the stability of the SSD controller, which also needs to operate the SLC write cache (built-in cache).

Conclusion No.2:

DRAMless SSDs are the worst choice when used in RAID (similar to SMR HDDs). Even worse, in the author's recommendation (above) you will find the most inappropriate mix of SSD drives for the RAID1 (based on the table above):
  • Two absolutely different controllers working with absolutely different firmware from different vendors
  • Two different approaches to cache handling
  • Two fundamentally different NAND flash technologies for cells.
My recommendation - avoid such recommended setup. And consider whether, after reading this document, you can trust the author's conclusion:

"everything worked as expected: the new disks I put in are marked as healthy and my data is safe. The results are in and RAID 1 with data protection works perfectly, even when you use disks from different brands. Even disks of different brands get along, without any sort of errors, so your data is perfectly safe."

Be careful. Your data is never perfectly safe. Never.
And not at all with such a setup.

What is your point?



Top