Week 2 DQ1 – Data Analysis and Business Intelligence | Westcliff University

Note: Only for Reference

In the second week of our course, we completely covered two chapters (3, 4) and started chapter 5 from our course of Data Analysis and Business Intelligence where we discussed the various topics like measures of location (mean, median, mode), measures of dispersion (range, standard deviation, variance), various charts (dot plots, stem-leaf displays, box plot) for displaying and exploring the data, measures of position, Skewness and coefficient of variation, Interpretation of scatter diagram and contingency table.

Along with this we put insight on some of the basic concepts of probability like ways of assigning probability, types of probability, mutually exclusive events, rules of addition and multiplication, complement rule. This week 2 discussion question is mainly focused on the concepts of chapter 3 and 4.

“Mean is affected by extreme values but the median is not”.

It is a true statement that “Mean is affected by extreme values but the median is not”. Mean is the average of a range of values or quantities, computed by dividing the total of all values by the number of values. The median is that value of the variate which divides the total frequency into two halves.

Median is the value which lies exactly at the center of the distribution regardless of the data’s arrangement in ascending or descending order but on the other hand, mean considers all the data and the heavy fluctuation of the data at the either of the end will shift the mean towards them. Therefore, the mean is affected by extreme values but the median is not.

Numerical Illustration for Lincolnville School District bus data

Given Information:                                                                                              

  • Number of observation (N)= 80
  • Minimum Value = 505
  • Maximum Value = 10575

Class Interval

Frequency

(F)

Mid Value(M)

Cum

Frequency

(C.F)

 

F*M

 

(M-X̅)

 

(M-X̅)2

 

F(M-X̅)2

500-1950

8

1225

8

9800

-3280

10758400

86067200

1950-3400

19

2675

27

50825

-1830

3348900

63629100

3400-4850

27

4125

54

111375

-380

144400

3898800

4850- 6300

11

5575

65

61325

1070

1144900

12593900

6300-7750

5

7025

70

35125

2520

6350400

31752000

7750-9200

5

8475

75

42375

3970

15760900

78804500

9200-10650

5

9925

80

49625

5420

29376400

146882000

 

N=80

 

 

∑FM =

360450

 

 

F(M-X̅)2 =

423627500

 

 

 

Working Notes:

For a number of classes

 

For the value of class interval

 

According to “2 to the K rule”

2k > N ,2k > 80

Trying k = 6 then:

26 > 80

64 > 80 (i.e., 64 is less than 80 which indicates that 6 classes will be not enough.)

Trying the next highest number k = 7 then:

27 >80

128>80 (i.e., 128 is greater than 80 which indicates that suggested number of classes is 7.)

i = (highest value-lowest value)/K

i= (10575-505)/7

i=1438.57

(let’s assume i=1450 approximately as the value should be in the multiplication of 10 or 100)

 

 

Mean (X̅)

Standard Deviation(σ)

Median

 

= ∑FM/N = 360450/80

= 4505

 

σ = √∑f (M-X̅)2 /N

 

= √423627500/80

 

=2301.16

Median Class = 3400 – 4850

Median = L + [{(∑F/2)-C.F}/ F]*(H-L)

=3400+ [{80/2-27}/27]*(4850-3400)

=4098.1418

=4098

 

For Range:

Range = Maximum value – Minimum value = 10575-505 =10070

  1. Answer

After looking at the Frequency table of Bus maintenance for Lincolnville School District, the data tend to cluster in between 3400 and 4850 as it contains the highest number of frequencies. The mean maintenance cost was determined to be 4505 with a median cost of 4098.

Yes, one measure could be more representative of the typical cost than the others. This is because mean would not be suitable in case of cost data’s with large fluctuations as it considers all the values causing to shift the mean towards extremities. So the median would be appropriate in such cases. But in case of data’s with lower fluctuations, mean would definitely be the better option. So depending upon the situation, one measure could represent the data’s more accurate than the others.

  1. Answer

The range (ie; 10070) of the maintenance costs are from 9200 to 10650. After evaluation, the standard deviation was determined to be 2301.16.

Empirical rule = X̅ ± 2 σ

= 4505 ± (2*2301.16)

= 4505 ± 4602.32

= 4505 – (2 * 2301.16) to 4505+ (2 *2301.16)

= -97.32 to 9107.32

Since two standard deviations plus and minus the mean hold 95% of the maintenance costs according to the Empirical Rule the high and low of the 95% interval are- 97.32 and 9107.32.

 

  1. Answer
 
Make a box and whisker plot
 

 

 

Difference

Minimum

505

505

Q1

3081

2576

Q2(Median)

4178.5

1097.5

Q3

5408

1229.5

Maximum

10650

5167

Fig: Whisker’s Box Plot

From the above Whisker Box Plot, we can see that the distribution is positively skewed because the median is not centrally located and the distance of median from the first quartile is less than the distance of the median from the third quartile.

 

Author: Smirti

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.