always happy when reliability data is shared, for what it is worth

Currently reading
always happy when reliability data is shared, for what it is worth

952
318
NAS
DS620slim, DS415+
Operating system
  1. Linux
  2. macOS
  3. Windows
Mobile operating system
  1. Android
  2. iOS
Any data on reliability is welcome for me, as it is not too often that this data is actually shared.
In this case a bit of info on HDD/SSD next to other devices/parts.

As always, be careful because of sample size (not shared in this case for most devices) and possible commercial interests:

 
No not for me please. Good that you ask...
Everybody, including myself, has stories about a disk that lasted twenty years and another one that died in a week. Some people even are sure that a complete brand is rubbish because of just one failure.

I am only interested in data that is well collected and documented and has lots of datapoints (>1000) as the big numbers tell a story, not the exceptions. A professional tick from a lean six sigma addict..
 
This will traditionally be a long shot. For people who want to know more.
You will learn interesting information related to this report and maybe when you will read it again, you will see it differently.

The mentioned report is interesting for the average reader because it is not common to read something regularly; and that's okay. When it is more important for the masses to read quick headlines, this is a precisely targeted type of report. Therefore, we have various portals that are literally focused on this group. They provide easy-to-digest data. Then there is a small group of people looking for something different - a carefully elaborated story. Something similar to books by Agatha Christie or Orson S. Card. This group of people needs to understand why things are happening, not just know they happened. I have this deviation too.

Quick exercise at the end of the workweek.
For me, Puget Systems ( the company) is an unknown brand. This is perfectly fine because of many factors (diff country, …). However, I read something about them on the internet. They seem to have an excellent job with their custom build computers. Similarly, they do not appear to focus on the low-cost mass-market but the medium to a high-end group of users and small businesses. And this must be taken into account when interpreting the results of their data.

Basic segmentation of their product portfolio; they build desktops and servers. No idea what is the exact share between these two segments. Experts recognize that the conditions for operation or workload or ... for servers may be different than for desktops, which has a logical impact on the lifespan of the elements in those boxes and also on failures. I miss this segmentation element in the report. What a pity if not a mistake.

The Sample amount
As mentioned in the first post, the whole report does not contain information about the sample from which it was created. This is because the expert lacks such information. Because there is a difference if the sample contains 100 devices or 200000 devices. I have seen many such charts. And too few people in the meetings raise their hands and ask for this information.

How to extrapolate how much this company can produce PCs per year?
They have high-quality custom build HW. Then it is necessary to build a single computer by one person to avoid assembly issues. Custom computer assembly is not a custom production car (although there are similarities) that impact their production time.

PC assembly staff (my consideration):
- 18 x HW assembly staff (follow the company website)
- Max utilization 8h per day within 220 labour days (250 total – 30d for a leave/sick days)
- One PC assembly time = 90 minutes (This is a speedy time, rather for cheap and easy computers )
Total assembled per year = 18 x (8x60/90) x 220 = 21 120 custom build PCs based on 100% assembly staff utilization.

Even if we consider that people work 250 working days (no leave, sick days, ...) and reduce the assembly time to 60 minutes(!), we can get the production of 36k computers per year with 100% utilization of this staff. Since we know that there is no such thing in this segment and considering the unavailability of many elements due to the Si crisis, I decided to use 75% utilization, which is about 27k computers annually.

Complementary view - Sales staff – for F2F consultations:

Every project begins with an interview conducted by our technical consultants to understand your exact workstation wants and needs:
- 7 consultants
- Max utilization 8h per day within 220 labour days
- One consultation = 30 minutes (! Pretty fast)
- Total consultation per year = 7 x 2 x 8 x 220 = 24 640 custom build PCs annually (100% success rate for every consultation)
- 80% success rate = 19 712 PCs …. Rounding to 20k

Custom setup (self-service w/o consultations) from their web:
There is no need to count for this kind of exercise because the assembly capacity is more important than their sales channels extrapolations.

We have just got an idea of the number of devices they produce per year = 27k.

If the average price of one such PC is around $ 4k (their base boxes price), then only new computers sales should reach a revenue of $108M. This amount must be added revenue from the sale of spare parts and paid service. However, this number is significantly higher than discovered on ten web sources (up to $30M only). So, the production will be considerably lower. If their Gross profit margin is at the level of 33.3%, it corresponds to approximately 10M at 30M revenue.
The number of employees is 63 (their web), and based on their role segmentation, the overhead costs (based on their Glassdoor data) could achieve $6M, and staff cost about $7.5M. What supports my theory of Gross profit margin value.

However, reducing the value of the computers produced annually from the estimated initially 27k to a value corresponding to their revenue is necessary. In this segment, the value of sales from new computers is at 70% of total revenue. Which of the maximum value of 30M represents 21M. With an average value of one $4K computer, this means 5,250 produced per year or 7k built with a $3K average price per single computer (they don't have such low priced computers).

So let's set the final number of computers produced to 6,000 per year. This represents an assembly staff utilization of 70% at 220 minutes to set up a single computer, which is not a long time for high-quality work, including perfectly tidy cabling in a computer case. Even if it is 20k, the results of my evaluation will not have a significant impact. However, it is very important to know at least this number - for a clearer interpretation of the graphs presented.

They wrote in the report:
Additionally, we are filtering out of this data any failures that we believe were caused accidentally by our employees or customers, as well as those related to damage in shipping. The goal here is to isolate problems from the hardware itself, rather than human error.
The split into "shop" and "field" failures gives me meaning. Great idea in this report.
My point: The disadvantage is the definition of causes and types of failures – what I’m missing in this report (I deal it with below).

They wrote in the report:
While we will only be covering hardware that we actively sold during 2021, we will include failure data from the last three years (2019 through 2021) when available and applicable. For example, we carried Intel's 10th Gen Core processors during the first few months of 2021 - so we will include all data on these processors from 2019 through 2021 when looking at their reliability. We didn't carry the 9th Gen Core processors at all this year, though, so they are excluded.
My point: The 10th Gen Core approach is correct. What doesn't seem right - they decided to filter out other supplied elements from 2019-2021. They created a distorted view. Because no one knows how many have been filtered out. This makes room for misinterpretation. And that's a big mistake.

When did those failures happen?
Nowhere have I read how many failures were failures discovered during the warranty period (they have a base of 1y and two additional 2 / 3y). Not to mention the time since delivery, the failures occurred. This is an essential piece of data for determining the durability of an element. It is insufficient information. Not to mention that the desktop/server segmentation is important in this case. Not every server needs to contain a kind of CPU for the server - the purpose of use defines this approach. Example K range from Intel is excellent for math operations in machine learning or in video rendering = as a stand-alone server.

And so I come to the last point I miss in this report - the types and causes of the failures.

CPU report part
It is insufficient to divide the CPU only into Generation ranges. So many differences there. Also, failures can be different and related to the component used (appropriate/inappropriate). A common mistake is to bake the CPU in the event of an overload and insufficient airflow, for example, in an environment that is not selected correctly to operate such computer. This can happen when suitable airflow technologies can be neglected at the expense of better performance at the same cost. Another point - How many of these CPUs were air-cooled and how much with liquid? From what total delivered ratio for such types of cooling? So if I look at the CPU graph, I don't know what type of failures occurred and what was the cause. And this is a source of misinterpretation. Rather, the assumption that one computer contained one CPU will apply here.

RAM report part
What I miss there are the vendors and PNs of these modules. This is a serious shortcoming of this report. It gives the impression that any RAM modules are just as good. OFC, how many and what modules were used in one computer. Therefore, the number of modules delivered would help better. For the inexperienced reader, a different scale on the Y-axis may indicate that DD4 RAM modules have the same amount of failures as CPUs. However, those failures are so negligibly low 0.22% of the number of computers delivered for 2019-2021 that it is enough to supply them with just such information, which I did not find for the average reader.

Disk report part
The disks themselves have a significant drawback, as it is not possible to find out about the failures that have occurred. And more importantly, how many of the delivered pieces counted those failures. It is impossible to take the number of computers produced as the basis for the number of disks shipped from a given manufacturer. And that's serious. It is not clear for what purpose they were used,…
HDDs from WD Ultrastar have three lines. It is not clear what this is about. In the case of WD Red, it is not at all clear that this was also an incorrectly used SMR for a purpose other than extensive sequential writing.
For example: with 30k delivered disks and 0.39% failure rate, we are talking about 117 disks for the period 2019-2021. That's about 39 disks a year = below the standard statistical deviation.
It is not mentioned that the user replaced the disk after the end of the warranty period, and the company did not get such information = did not get into the statistics. It is unclear.
Therefore, I would not take the disk information from this report into account at all.

This is my contribution to reading such reports. If I've helped any of you see more, I'm glad.

If I made a mistake somewhere, it was only because I devoted time during the coffee break.
 
Last edited:
Always good to hear from a completely other point of view.
I will not position myself as statistical expert (although I forgot more on statistics then some people will ever learn☺️). Could you explain “below the standard statistical deviation”?
As far as I know, the confidence limits are determined by the square root of the sample size and not by the number of defects, statistic interpretation is a matter of confidence level, there is no hard “below”.
But thanks for your nice additional explanation.

let me add, no data is perfect, you just have to be aware of the confidence level and use it accordingly.
 
Wikipedia source an easy-to-digest description:

Standard deviation of average height for adult men​

If the population of interest is approximately normally distributed, the standard deviation provides information on the proportion of observations above or below certain values. For example, the average height for adult men in the United States is about 70 inches (177.8 cm), with a standard deviation of around 3 inches (7.62 cm). This means that most men (about 68%, assuming a normal distribution) have a height within 3 inches (7.62 cm) of the mean (67–73 inches (170.18–185.42 cm)) – one standard deviation – and almost all men (about 95%) have a height within 6 inches (15.24 cm) of the mean (64–76 inches (162.56–193.04 cm)) – two standard deviations. If the standard deviation were zero, then all men would be exactly 70 inches (177.8 cm) tall. If the standard deviation were 20 inches (50.8 cm), then men would have much more variable heights, with a typical range of about 50–90 inches (127–228.6 cm). Three standard deviations account for 99.7% of the sample population being studied, assuming the distribution is normal or bell-shaped (see the 68-95-99.7 rule, or the empirical rule, for more information).
source:

translated into my wording:
with 30k delivered disks and 0.39% failure rate, we are talking about 117 disks for the period 2019-2021. That's about 39 disks a year = below the standard statistical deviation.
when there are almost 0.0% failures (0.39% exactly), then all the disks we can perceive as equally functional, literally flawless:
If the standard deviation were zero, then all men would be exactly 70 inches (177.8 cm) tall.

another point:
As far as I know, the confidence limits are determined by the square root of the sample size and not by the number of defects, statistic interpretation is a matter of confidence level, there is no hard “below”.
Confidence interval and standard deviation are two diff measurements for two diff purposes.

When the data from the disk report were more segmented, then we would certainly see other probably highest standard deviations.
 
In my humble opinion, you are in the wrong chapter of the statistics book.
failure analysis is pass/fail and therefore a non parametric analysis is needed.
You canot use a gaussian standard deviation on a dataset with only Broken or Non Broken disks.
I am happy to discuss, but not on the forum.
 
back to the topic
just curious what was the real reason for the big difference for Intel Core 11h Gen FAILURES achieved in their SHOP - CPU/computer wasn't delivered to a customer

1642174447897.png


there is just this explanation:
but we did see an oddly high rate of failures with Intel's consumer-oriented 11th Gen processors... which seems odd, especially next to the very low rates shown by the preceeding 10th Gen.
it is an unclear statement. It was based DOA from the Intel distribution chain or burned during tests or a human factor, ..
 

Create an account or login to comment

You must be a member in order to leave a comment

Create account

Create an account on our community. It's easy!

Log in

Already have an account? Log in here.

Similar threads

BASED ON EMAILS I RECEVE AT WORK EVERY DAY IS CAPS LOCK DAY FOR A COUPLE OF PEOPLE I WORK WITH.
Replies
3
Views
541
Have good one! 😉 Nice to have you here, guys! 👍🏻
Replies
6
Views
2,354

Welcome to SynoForum.com!

SynoForum.com is an unofficial Synology forum for NAS owners and enthusiasts.

Registration is free, easy and fast!

Back
Top