Note: Only for Reference
In the second week of our course, we completely covered two chapters (3, 4) and started chapter 5 from our course of Data Analysis and Business Intelligence where we discussed the various topics like measures of location (mean, median, mode), measures of dispersion (range, standard deviation, variance), various charts (dot plots, stemleaf displays, box plot) for displaying and exploring the data, measures of position, Skewness and coefficient of variation, Interpretation of scatter diagram and contingency table.
Along with this we put insight on some of the basic concepts of probability like ways of assigning probability, types of probability, mutually exclusive events, rules of addition and multiplication, complement rule. This week 2 discussion question is mainly focused on the concepts of chapter 3 and 4.
“Mean is affected by extreme values but the median is not”.
It is a true statement that “Mean is affected by extreme values but the median is not”. Mean is the average of a range of values or quantities, computed by dividing the total of all values by the number of values. The median is that value of the variate which divides the total frequency into two halves.
Median is the value which lies exactly at the center of the distribution regardless of the data’s arrangement in ascending or descending order but on the other hand, mean considers all the data and the heavy fluctuation of the data at the either of the end will shift the mean towards them. Therefore, the mean is affected by extreme values but the median is not.
Numerical Illustration for Lincolnville School District bus data
Given Information:
 Number of observation (N)= 80
 Minimum Value = 505
 Maximum Value = 10575
Class Interval  Frequency (F)  Mid Value(M)  Cum Frequency (C.F) 
F*M 
(MX̅) 
(MX̅)2 
F(MX̅)^{2} 
5001950  8  1225  8  9800  3280  10758400  86067200 
19503400  19  2675  27  50825  1830  3348900  63629100 
34004850  27  4125  54  111375  380  144400  3898800 
4850 6300  11  5575  65  61325  1070  1144900  12593900 
63007750  5  7025  70  35125  2520  6350400  31752000 
77509200  5  8475  75  42375  3970  15760900  78804500 
920010650  5  9925  80  49625  5420  29376400  146882000 
 N=80 

 ∑FM = 360450 

 ∑ F(MX̅)^{2 }= 423627500

Working Notes:
For a number of classes
 For the value of class interval

According to “2 to the K rule” 2k > N ,2k > 80 Trying k = 6 then: 26 > 80 64 > 80 (i.e., 64 is less than 80 which indicates that 6 classes will be not enough.) Trying the next highest number k = 7 then: 27 >80 128>80 (i.e., 128 is greater than 80 which indicates that suggested number of classes is 7.)  i = (highest valuelowest value)/K i= (10575505)/7 i=1438.57 (let’s assume i=1450 approximately as the value should be in the multiplication of 10 or 100)

Mean (X̅)  Standard Deviation(σ)  Median 
X̅ = ∑FM/N = 360450/80 = 4505 
σ = √∑f (MX̅)^{2} /N
= √423627500/80
=2301.16  Median Class = 3400 – 4850 Median = L + [{(∑F/2)C.F}/ F]*(HL) =3400+ [{80/227}/27]*(48503400) =4098.1418 =4098 
For Range:
Range = Maximum value – Minimum value = 10575505 =10070
 Answer
After looking at the Frequency table of Bus maintenance for Lincolnville School District, the data tend to cluster in between 3400 and 4850 as it contains the highest number of frequencies. The mean maintenance cost was determined to be 4505 with a median cost of 4098.
Yes, one measure could be more representative of the typical cost than the others. This is because mean would not be suitable in case of cost data’s with large fluctuations as it considers all the values causing to shift the mean towards extremities. So the median would be appropriate in such cases. But in case of data’s with lower fluctuations, mean would definitely be the better option. So depending upon the situation, one measure could represent the data’s more accurate than the others.
 Answer
The range (ie; 10070) of the maintenance costs are from 9200 to 10650. After evaluation, the standard deviation was determined to be 2301.16.
Empirical rule = X̅ ± 2 σ
= 4505 ± (2*2301.16)
= 4505 ± 4602.32
= 4505 – (2 * 2301.16) to 4505+ (2 *2301.16)
= 97.32 to 9107.32
Since two standard deviations plus and minus the mean hold 95% of the maintenance costs according to the Empirical Rule the high and low of the 95% interval are 97.32 and 9107.32.
 Answer

 Difference 
Minimum  505  505 
Q1  3081  2576 
Q2(Median)  4178.5  1097.5 
Q3  5408  1229.5 
Maximum  10650  5167 
Fig: Whisker’s Box Plot
From the above Whisker Box Plot, we can see that the distribution is positively skewed because the median is not centrally located and the distance of median from the first quartile is less than the distance of the median from the third quartile.