Note: Only for Reference
In the second week of our course, we completely covered two chapters (3, 4) and started chapter 5 from our course of Data Analysis and Business Intelligence where we discussed the various topics like measures of location (mean, median, mode), measures of dispersion (range, standard deviation, variance), various charts (dot plots, stemleaf displays, box plot) for displaying and exploring the data, measures of position, Skewness and coefficient of variation, Interpretation of scatter diagram and contingency table.
Along with this we put insight on some of the basic concepts of probability like ways of assigning probability, types of probability, mutually exclusive events, rules of addition and multiplication, complement rule. This week 2 discussion question is mainly focused on the concepts of chapter 3 and 4.
“Mean is affected by extreme values but the median is not”.
It is a true statement that “Mean is affected by extreme values but the median is not”. Mean is the average of a range of values or quantities, computed by dividing the total of all values by the number of values. The median is that value of the variate which divides the total frequency into two halves.
Median is the value which lies exactly at the center of the distribution regardless of the data’s arrangement in ascending or descending order but on the other hand, mean considers all the data and the heavy fluctuation of the data at the either of the end will shift the mean towards them. Therefore, the mean is affected by extreme values but the median is not.
Numerical Illustration for Lincolnville School District bus data
Given Information:
 Number of observation (N)= 80
 Minimum Value = 505
 Maximum Value = 10575
Class Interval 
Frequency (F) 
Mid Value(M) 
Cum Frequency (C.F) 
F*M 
(MX̅) 
(MX̅)2 
F(MX̅)^{2} 
5001950 
8 
1225 
8 
9800 
3280 
10758400 
86067200 
19503400 
19 
2675 
27 
50825 
1830 
3348900 
63629100 
34004850 
27 
4125 
54 
111375 
380 
144400 
3898800 
4850 6300 
11 
5575 
65 
61325 
1070 
1144900 
12593900 
63007750 
5 
7025 
70 
35125 
2520 
6350400 
31752000 
77509200 
5 
8475 
75 
42375 
3970 
15760900 
78804500 
920010650 
5 
9925 
80 
49625 
5420 
29376400 
146882000 

N=80 


∑FM = 360450 


∑ F(MX̅)^{2 }= 423627500

Working Notes:
For a number of classes

For the value of class interval

According to “2 to the K rule” 2k > N ,2k > 80 Trying k = 6 then: 26 > 80 64 > 80 (i.e., 64 is less than 80 which indicates that 6 classes will be not enough.) Trying the next highest number k = 7 then: 27 >80 128>80 (i.e., 128 is greater than 80 which indicates that suggested number of classes is 7.) 
i = (highest valuelowest value)/K i= (10575505)/7 i=1438.57 (let’s assume i=1450 approximately as the value should be in the multiplication of 10 or 100)

Mean (X̅) 
Standard Deviation(σ) 
Median 
X̅ = ∑FM/N = 360450/80 = 4505 
σ = √∑f (MX̅)^{2} /N
= √423627500/80
=2301.16 
Median Class = 3400 – 4850 Median = L + [{(∑F/2)C.F}/ F]*(HL) =3400+ [{80/227}/27]*(48503400) =4098.1418 =4098 
For Range:
Range = Maximum value – Minimum value = 10575505 =10070
 Answer
After looking at the Frequency table of Bus maintenance for Lincolnville School District, the data tend to cluster in between 3400 and 4850 as it contains the highest number of frequencies. The mean maintenance cost was determined to be 4505 with a median cost of 4098.
Yes, one measure could be more representative of the typical cost than the others. This is because mean would not be suitable in case of cost data’s with large fluctuations as it considers all the values causing to shift the mean towards extremities. So the median would be appropriate in such cases. But in case of data’s with lower fluctuations, mean would definitely be the better option. So depending upon the situation, one measure could represent the data’s more accurate than the others.
 Answer
The range (ie; 10070) of the maintenance costs are from 9200 to 10650. After evaluation, the standard deviation was determined to be 2301.16.
Empirical rule = X̅ ± 2 σ
= 4505 ± (2*2301.16)
= 4505 ± 4602.32
= 4505 – (2 * 2301.16) to 4505+ (2 *2301.16)
= 97.32 to 9107.32
Since two standard deviations plus and minus the mean hold 95% of the maintenance costs according to the Empirical Rule the high and low of the 95% interval are 97.32 and 9107.32.
 Answer


Difference 
Minimum 
505 
505 
Q1 
3081 
2576 
Q2(Median) 
4178.5 
1097.5 
Q3 
5408 
1229.5 
Maximum 
10650 
5167 
Fig: Whisker’s Box Plot
From the above Whisker Box Plot, we can see that the distribution is positively skewed because the median is not centrally located and the distance of median from the first quartile is less than the distance of the median from the third quartile.