# Week 2 DQ1 – Data Analysis and Business Intelligence | Westcliff University

Note: Only for Reference

In the second week of our course, we completely covered two chapters (3, 4) and started chapter 5 from our course of Data Analysis and Business Intelligence where we discussed the various topics like measures of location (mean, median, mode), measures of dispersion (range, standard deviation, variance), various charts (dot plots, stem-leaf displays, box plot) for displaying and exploring the data, measures of position, Skewness and coefficient of variation, Interpretation of scatter diagram and contingency table.

Along with this we put insight on some of the basic concepts of probability like ways of assigning probability, types of probability, mutually exclusive events, rules of addition and multiplication, complement rule. This week 2 discussion question is mainly focused on the concepts of chapter 3 and 4.

“Mean is affected by extreme values but the median is not”.

It is a true statement that “Mean is affected by extreme values but the median is not”. Mean is the average of a range of values or quantities, computed by dividing the total of all values by the number of values. The median is that value of the variate which divides the total frequency into two halves.

Median is the value which lies exactly at the center of the distribution regardless of the data’s arrangement in ascending or descending order but on the other hand, mean considers all the data and the heavy fluctuation of the data at the either of the end will shift the mean towards them. Therefore, the mean is affected by extreme values but the median is not.

Numerical Illustration for Lincolnville School District bus data

Given Information:

• Number of observation (N)= 80
• Minimum Value = 505
• Maximum Value = 10575
 Class Interval Frequency (F) Mid Value(M) Cum Frequency (C.F) F*M (M-X̅) (M-X̅)2 F(M-X̅)2 500-1950 8 1225 8 9800 -3280 10758400 86067200 1950-3400 19 2675 27 50825 -1830 3348900 63629100 3400-4850 27 4125 54 111375 -380 144400 3898800 4850- 6300 11 5575 65 61325 1070 1144900 12593900 6300-7750 5 7025 70 35125 2520 6350400 31752000 7750-9200 5 8475 75 42375 3970 15760900 78804500 9200-10650 5 9925 80 49625 5420 29376400 146882000 N=80 ∑FM = 360450 ∑ F(M-X̅)2 = 423627500

Working Notes:

 For a number of classes For the value of class interval According to “2 to the K rule” 2k > N ,2k > 80 Trying k = 6 then: 26 > 80 64 > 80 (i.e., 64 is less than 80 which indicates that 6 classes will be not enough.) Trying the next highest number k = 7 then: 27 >80 128>80 (i.e., 128 is greater than 80 which indicates that suggested number of classes is 7.) i = (highest value-lowest value)/K i= (10575-505)/7 i=1438.57 (let’s assume i=1450 approximately as the value should be in the multiplication of 10 or 100)

 Mean (X̅) Standard Deviation(σ) Median X̅ = ∑FM/N = 360450/80 = 4505 σ = √∑f (M-X̅)2 /N   = √423627500/80   =2301.16 Median Class = 3400 – 4850 Median = L + [{(∑F/2)-C.F}/ F]*(H-L) =3400+ [{80/2-27}/27]*(4850-3400) =4098.1418 =4098

For Range:

Range = Maximum value – Minimum value = 10575-505 =10070

After looking at the Frequency table of Bus maintenance for Lincolnville School District, the data tend to cluster in between 3400 and 4850 as it contains the highest number of frequencies. The mean maintenance cost was determined to be 4505 with a median cost of 4098.

Yes, one measure could be more representative of the typical cost than the others. This is because mean would not be suitable in case of cost data’s with large fluctuations as it considers all the values causing to shift the mean towards extremities. So the median would be appropriate in such cases. But in case of data’s with lower fluctuations, mean would definitely be the better option. So depending upon the situation, one measure could represent the data’s more accurate than the others.

The range (ie; 10070) of the maintenance costs are from 9200 to 10650. After evaluation, the standard deviation was determined to be 2301.16.

Empirical rule = X̅ ± 2 σ

= 4505 ± (2*2301.16)

= 4505 ± 4602.32

= 4505 – (2 * 2301.16) to 4505+ (2 *2301.16)

= -97.32 to 9107.32

Since two standard deviations plus and minus the mean hold 95% of the maintenance costs according to the Empirical Rule the high and low of the 95% interval are- 97.32 and 9107.32.

Make a box and whisker plot

 Difference Minimum 505 505 Q1 3081 2576 Q2(Median) 4178.5 1097.5 Q3 5408 1229.5 Maximum 10650 5167 Fig: Whisker’s Box Plot

From the above Whisker Box Plot, we can see that the distribution is positively skewed because the median is not centrally located and the distance of median from the first quartile is less than the distance of the median from the third quartile. ## Author: Smirti

This site uses Akismet to reduce spam. Learn how your comment data is processed.