DOE: WHAT IS IT?


  1. DOE stands for Design of Experiments. It is a statistical technique widely used in science and engineering to establish a mathematical relationship between (i) the responses (Y-s) aka system behavior on the one hand and (ii) the independent factors (X-s) which affect the responses on the other hand. A minimum of 2 X-s is required for a DOE.

  2. Experiments are run i.e. the system is operated at certain selected values of X-s and the responses (Y- s) recorded. A mathematical relationship between the Y-s and the X-s is then derived as mentioned in 1 above, and this relationship is called a Transfer Function (TF). This exercise is also called Model Building.

  3. We use this TF to (i) make predictions about the system response for any values of X-s within the intervals considered for experimentation (ii) find factor settings that will give us a desired value or range of Y and (iii) find factor settings to optimize two or more Y-s at one go.

  4. The TF is the KEY to understand system behavior, and is the ultimate objective of any DOE.

  5. Generally, we are interested to determine which factor(s) affect the mean of the response. If the experiments are replicated, as they should be, then we also try to find a relationship between the standard deviations of the Y-s and the X-s i.e. in other words, find which factors determine the variability of the response.

  6. For a chemical reaction, the Y-s could be yield, viscosity, refractive index, molecular weight, impurity content etc. while the X-s can be temperature, pressure, flowrate, catalyst type / quantity, order of addition, quantities of raw material etc.



International Pi Day


Today is Pi Day — it’s 3/14 as per the American style of writing dates. Wiki says that 22nd July is an alternate date for Pi Day.
 
I was reminded of those lovely mnemonics involving π. Let me start with the simplest, and perhaps the most appropriate one due to one C. Heckman which goes thus: HOW I WISH I COULD CALCULATE PI.
 
The number of letters in each word gives the sequence of digits for π, so π is 3.141592 as per the above mnemonic.
The one which we knew since our high school days was of course MAY I HAVE A LARGE CONTAINER OF COFFEE, which gives the value of π to 8 digits which is 3.1415926.
 
The mnemonic by Sir James Jeans is perhaps the most scintillating: HOW I WANT A DRINK, ALCOHOLIC OF COURSE, AFTER THE HEAVY LECTURES INVOLVING QUANTUM MECHANICS which is π to 15 digits: 3.14159265358979. I first read this in the book by Posamentier and Lehmann referred to below.
 
If we add the phrase “All of thy geometry, Herr Planck, is fairly hard,” to Sir James Jeans’s mnemonic, we get a total of 24 digits in all: 3.14159265358979323846264.
 
Posamentier and Lehmann in their book their book “π: Biography of the World’s Most Mysterious Number” describe the wealthy Frenchman George Buffon’s mind blowing attempt in 1777 to calculate π which goes like this.
 
Suppose you have a piece of paper with ruled parallel lines equally spaced at a distance d between lines, and a thin needle of length I where l < d. You then toss the needle onto the paper many times.
 
Buffon claimed that the probability that the needle will touch one of the ruled lines is 2l/πd. However Buffon wasn’t a famous chap, so this claim languished in obscurity.
 
About 35 yr later, the great French mathematician Pierre Simeon Laplace popularized it, and that’s how the world came to know of it.
 
Posamentier and Lehmann say that we can try this out for ourselves. If for the sake of simplicity, we let l = d, then the value of π = 2/p where p is the probability that the needle touches a line. The value of p is given by
 
p = no of times the needle touches a line/total number of tosses.
 
An Italian mathematician by name Mario Lazzarini ACTUALLY did this in 1901. He tossed the needle 3408 times, and came up with a value of π = 3.1415929.
 
BTW, the concept of Pi Day was first implemented in 1988 by American physicist cum curator cum artist who worked in the San Francisco Science Museum.
 
UNESCO declared Pi Day as the International Day of Mathematics in Nov 2019.
 

Covid and Rev Bayes


I wrote this piece on LinkedIn in Oct 20 when we were just recovering from the pandemic. The Hindu of Chennai was publishing daily mortality rates.
 
This is dedicated to the Most Rev Thomas Bayes (1701-1761) who was an English statistician and philosopher, not to mention a Presbyterian minister. The Reverend's theorem is used to calculate posterior probabilities given prior ones. In these dreadful Covid times, let me give a very simple application.
 
Suppose you have developed a Covid test which has a sensitivity of 99%. A sensitivity of 99% means that if you were to administer this test to a hundred KNOWN cases of Covid, the test will give a positive signal for 99 people. In other words, the test will correctly identify 99 people as having the disease, however 1 person will pass through undetected. 1% therefore is the false negative rate. Tragic, isn't it?
Another equally important concept is that of specificity. Suppose your test has a specificity of 98%. This means that if you were to administer this test to 100 people who DON'T have the disease, then it will pass 98 people. In other words, out of a hundred "clean" people, the test will correctly identify 98 as free of the disease, but will falsely identify 2 persons as having the disease when they don't! 2% is the false positive rate. Dangerous, isn't it?The one which we knew since our high school days was of course MAY I HAVE A LARGE CONTAINER OF COFFEE, which gives the value of π to 8 digits which is 3.1415926.
 
So far so good. I congratulate you on your patience for having stayed this far!
 
Let's define some stuff now. Let us denote the disease as D, and the probability of having the disease as p(D). The Hindu online edition on 4 Oct 2020 stated that the total number of ACTIVE Covid-19 cases in India is 937856. The Worldometer gives the total Indian population to be 1380 million (MM), so this works out to a prevalence of (0.938 MM/1380 MM) = 0.068%. In other words, p(D) = 0.068% = 0.00068
 
This value of p(D) is contested by many, who say that the actual prevalence is many times more, may be even a hundred times. Since I have no clue what the real number is, I will go with the calculated figure of zero point zero six eight percent. Some experts are of the opinion that the prevalence is about a hundred times more, say in the vicinity of 5-7%, probably with good reason too given the nature of our reporting and the state of our healthcare system.
 
Let the event of getting a positive signal from the test you have developed be denoted as S. Since your sensitivity is 99%, it means that p(S | D) is 0.99; this is read as "the probability that the test gives a positive signal GIVEN THAT a person has the disease is 99%". This is an example of what is called conditional probability.
 
Since your test has 98% specificity, it means that p(S | D') = 1 - 0.98 = 0.02 which reads as "the probability that the test signals positive GIVEN THAT a person DOESN'T have the disease is 2%". Why? Like I explained earlier, the test correctly passes 98 people as clean out of a total of 100 clean people, which means 2 clean persons out of a 100 clean are incorrectly classified as having the disease when they don't. In other words, the test gives a positive signal S when the person doesn't have the disease.
 
Please note that any medical test of the YES/NO variety MUST indicate the sensitivity and the specificity. In our case we have assumed sensitivity to be 99% and specificity to be 98%.
 
Bayes' Theorem tells you how to calculate p (D | S) if you know p(S | D), and of course p(D) and the specificity. For further details, look up page 55 of Applied Statistics and Probability for Engineers by DC Montgomery and George Runger, 6th edition. I will just give the result here:
 
p(D | S) = { p(S | D) * p(D) } /{ p(S | D) * p(D) + p(S | D') * p(D')}
Plugging in the values of p(S | D) = 0.99, p(D) = 0.00068 (0.068% calculated above), p(S | D') = 0.02 and p(D') = 1 - 0.00068 = 0.99932, we have
p(D | S) = 3.25%. This is also called the PPV or the Positive Predicted Value.
 
What this means is that if a test is 99% sensitive and 98% specific, AND if the total incidence of the disease in the population is 0.068%, then the probability that the person has the disease GIVEN THAT he has tested positive is ONLY 3.25%. This implies that under these circumstances, if the test declares 100 people as having the disease, only 3 will actually have it!!
 
 
However if you change the value of p(D) from 0.068% to 7%, which is what many prophets say is the true prevalence rate, then the PPV works out to 78.8%. Under this circumstance, if the test declares 100 people as positive, only 79 actually have the disease!
 
The result is STRONGLY dependent on the specificity of the test, and of course on the prevalence of the disease.
 
Please don't blame Rev. Bayes, or worse still ME, if you don't like the numbers. The result that we are interested in i.e. the probability that a person HAS the disease GIVEN THAT he has tested positive, also called the Positive Predicted Value (PPV), is a function of (i) the prevalence of the disease in the general population (ii) the sensitivity of the test used and most importantly (iii) the specificity of the test.
 
I have made an excel sheet for those who want to play around, let me know if you want it.
 
Of course, it is very difficult to get an idea of p(D) at the outset.