Seu SlideShare está sendo baixado.
×

- 1. Mathematics in the Modern World
- 2. Chapter 4: Data Management
- 3. 4.1 Descriptive Statistics
- 4. Statistics is a branch of mathematics that deals with data collection, organization, analysis, interpretation and presentation. Data collection is defined as the procedure of collecting, measuring and analyzing accurate insights for research using standard validated techniques. Data organization refers to the method of classifying and organizing data sets to make them more useful, it can be applied to physical records or digital records.
- 5. Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Interpretation of data is the process of assigning meaning to the collected information and determining the conclusions, significance, and implications of the findings. Presentation of data refers to the organization of data into tables, graphs or charts, so that logical and statistical conclusions can be derived from the collected measurements.
- 6. Descriptive Statistics gives us information or help describe the characteristics of a specific data set by giving short summaries about the sample and measures of the data. Basic Statistical Concepts A population consists of the totality of the observation and sample is a part of the population. A variable is any characteristics, number, or quantity that can be measured or counted.
- 7. Two kinds of variables: 1. Qualitative variables also called as categorical variables are variables that are not numerical. It describes data that fits into categories. 2. Quantitative variables are numerical. It can be ranked and has order.
- 8. Quantitative variables can be classified further into discrete variables and continuous variables. A discrete variable is a variable whose value is obtained by counting. Continuous variables can assume an infinite number of values between any two specific values. They are obtained by measuring. They often include fractions and decimals.
- 9. Examples Discrete number of students present number of red marbles in a jar number of heads when flipping three coins students’ grade level Continuous height of students in class weight of students in class time it takes to get to school distance traveled between classes
- 10. Types of Statistical Data 1.Numerical data. These data have meaning as a measurement such as a person’s height, weight, IQ, or blood pressure or shares of stocks a person owns. 2.Categorical data: Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female) but those numbers don’t have mathematical meaning.
- 11. Four Levels of Measurement 1. Nominal – the lowest of the four ways to characterize data. It deals with names, categories, or labels. (eg. colors of eyes, yes or no responses to a survey, favorite breakfast cereal, and number on the back of a football jersey). 2. Ordinal – the data at this level can be ordered but no differences between the data. (eg. ten cities are ranked from one to ten, but differences between the cities don't make much sense, letter grades where we can order things so that A is higher than B but without any other information). 3. Interval – deals with data that can be ordered, and in which differences between the data does make sense. But data at this level has no starting point. (eg. Fahrenheit and Celsius scales of temperatures). 4. Ratio – the highest level of measurement. Data possess all of the features of the interval level, in addition to an absolute zero. Due to the presence of a zero, it now makes sense to compare the ratios of measurements.
- 12. 4.2 Data Collection Method
- 13. Methods of Collecting Data 1. In-Person Interviews Pros: In-depth and a high degree of confidence on the data Cons: Time consuming, expensive and can be dismissed as anecdotal 2. Mail Surveys Pros: Can reach anyone and everyone – no barrier Cons: Expensive, data collection errors, lag time 3. Phone Surveys Pros: High degree of confidence on the data collected, reach almost anyone Cons: Expensive, cannot self-administer, need to hire an agency 4. Web/Online Surveys Pros: Cheap, can self-administer, very low probability of data errors Cons: Not all your customers might have an email address/be on the internet, customers may be wary of divulging information online
- 14. Three Ways of Presenting Data 1.Textual – this method comprises data presentation with the help of a paragraph or a number of paragraphs. 2.Tabular – the method of presenting data using the statistical table. A systematic organization of data in columns and rows. 3.Graphical – a chart representing the quantitative variations or changes of variables in pictorial or diagrammatic form.
- 15. 4.3 Frequency Distribution
- 16. Frequency is the rate that measures how often something occurs. Example 1 Jack joins football practice every Wednesday morning, Sunday morning and afternoon. The frequency of Jack’s football practice every week is 3 (2 on Sunday and 1 on Wednesday). By counting frequencies we can make Frequency Distribution Table.
- 17. Example 2 Jack’s team has scored the following numbers of goals in their games, 3, 1, 2, 1, 3, 2, 4, 2, 3, 2, 5, 4, 3, 2. Jack put the numbers in order, then added up: how often 1 occurs (2 times), how often 2 occurs (5 times), how often 3 occurs (4 times), how often 4 occurs (2 times), how often 5 occur (1 time)
- 18. Graphical Representation of Frequency Distribution A. Bar Graph is a pictorial representation of statistical data in such a way that length of the rectangles in the graph represents the proportional value of the variable. Bar graphs are generally used to compare the values of several variables at a time to analyze data. The length of the bars (horizontal or vertical) represents the frequency of the variable and is applicable to discrete categories only.
- 19. B. Line graph or Line chart is a graphical display of information that changes continuously over time. Within a line graph, there are points connecting the data to show a continuous change. The lines in a line graph can descend and ascend based on the data. We can also compare different events, situations, and information.
- 20. C. Pie Chart is a type of graph that displays data in a circular graph. The pieces of the graph are proportional to the fraction of the whole in each category. Each slice of the pie is relative to the size of that category in the group as a whole. The entire “pie” represents 100 percent of a whole, while the pie “slices” represent portions of the whole.
- 21. 4.4 Measures of Central Tendency
- 22. A. Mean It is the most common measure of central location. It can be obtained by getting the sum of all values of the observations divided by the number of observations. In computing for the mean, we use 𝑥 = 𝑥 𝑛 where x is the value of each observations in the sample n is the total number of observations in the sample It is worth noting that the mean has the following characteristics: 1. The mean is affected by the presence of extreme values. 2. The sum of the deviations of the observations from the mean is zero. 3. The sum of the squared deviations of the observations from the mean is minimum. 4. It is a good measure for interval and ratio type of data.
- 23. B. Median It is the middle value of a set of observations arranged in increasing or decreasing order. This measure divides the data into two equal number of observations. The median has the following characteristics: 1. It is not affected by the presence of extreme observations. 2. The sum of absolute deviations of the observation from the median is minimum. 3. It is an appropriate measure for an ordinal type of data.
- 24. C. Mode It is the most repeated value or the value that occurs for the most number of times. Note that it is possible for a certain data to have two modes. In such case, the distribution of the data set is bimodal (with two modes). When a certain data set has more than two modes, the distribution is called multimodal distribution. The mode has the following characteristics: 1. Mode is determined by frequency. 2. It is an appropriate measure for nominal data.
- 25. Example 1 (for ungrouped data) The following are the 3rd year math grades of an applied math student: 1.6 1.2 1.9 1.5 1.5 1.5 1.0 1.3 1.0 Mean: X = X1 + X2 + ⋯ + X9 9 = 1.6 + 1.2 + 1.9 + 1.5 + 1.5 + 1.5 + 1.0 + 1.3 + 1.0 9 = 1.39 Median: 1.0 1.0 1.2 1.3 1.5 1.5 1.5 1.6 1.9 Mode: 1.5
- 26. Example 2 (for grouped data) The mean for grouped data is given by Where fi is the frequency of the ith class interval xi is the class mark of the ith interval Solving for the mean: Class limit 𝒇 𝒙 𝒇𝒙 < 𝒄𝒇 Class boundaries 60 – 67 2 63.5 127 2 59.5 – 67.5 52 – 59 2 55.5 111 4 51.5 – 59.5 44 – 51 6 47.5 285 10 43.5 – 51.5 36 – 43 10 39.5 395 20 35.5 – 43.5 28 – 35 7 31.5 220.5 27 27.5 – 35.5 20 – 27 3 23.5 70.5 30 19.5 – 27.5 𝑥 = 𝑓𝑖𝑥𝑖 𝑛 𝑥 = 127 + 111 + 285 + 395 + 220.5 + 70.5 30 = 40.3
- 27. The median for grouped data is given by 𝑀𝑑 = 𝐿𝐶𝐵 + 𝑛 2 − 𝑐𝑓 𝑝 𝑓 𝑚 𝑖 i p cf m f where LCB is lower boundary of the median class is the size of the class interval is the cumulative frequency of the interval preceding the median class is the frequency of the median class Median Class – is the class containing cumulative frequency equal to n2 or next higher.
- 28. Solving for median: n 2 = 30 2 = 15 Lower Limit of the Class Boundary LCB = 35.5 Cumulative Frequency before the median class 𝑐𝑓 𝑝 = 10 Frequency of the median class fm = 10 Class Size (i) = 8 Median = LCB + n 2 − 𝑐𝑓𝑝 fm i = 35.5 + 15 − 10 10 8 = 39.5
- 29. The mode for grouped data is given by 𝑀𝑜 = 𝐿𝐶𝐵 + 𝑓 𝑚 − 𝑓1 2𝑓 𝑚 − 𝑓1 − 𝑓2 𝑖 i 1 f 2 f where LCB is the lower boundary of the modal class is the size of the class interval fm is the frequency of the modal class is the frequency of the class preceding the modal class is the frequency of the class following the modal class Modal Class – is the class with the highest frequency.
- 30. Solving for mode: Mode = LCB + 𝑓𝑚 − 𝑓1 2𝑓𝑚 − 𝑓1 − 𝑓2 i = 35.5 + 10 − 7 20 − 7 − 6 8 = 38.9
- 31. 4.5 Measures of Variability
- 32. Variability for Ungrouped Data • Range - The range (R) is defined as the difference between the highest value (HV) and the lowest value (LV) in the data. That is, LV HV R • Variance It is defined as the average of the squared deviations from the mean. It is the measure that considers the position of each observation relative to the mean. 𝑠2 = 𝑖 𝑥𝑖 − 𝑥 2 𝑛 − 1 or ) 1 ( 2 2 2 n n x x n s
- 33. • Standard Deviation (the most widely encountered) - It is the measure of the spread or dispersion of scores from the mean of distribution. It is the square root of the variance. 𝑠 = 𝑖 𝑥𝑖 − 𝑥 2 𝑛 − 1 or ) 1 ( 2 2 n n x x n s Variability for Grouped Data Range: mark Class Lowest mark Class Highest R Variance: ) 1 ( 2 2 2 n n fx fx n s Standard Deviation: ) 1 ( 2 2 n n fx fx n s
- 34. 4.6 Testing a Statistical Hypothesis
- 35. Hypothesis testing is the most significant area of statistical inference. It is a step-by-step process in making inferences (conclusions) about a population. The truth value of a statistical hypothesis can only be identified when we take a portion of the population of interest and use the information obtained from this portion to decide whether the statistical hypothesis is likely to be true or false. We either “reject” the statistical hypothesis when inconsistencies from the sample occur, or “not reject” otherwise. Note that the rejection of a statistical hypothesis means that it is false, but its acceptance does not necessarily mean it is true. Acceptance of the stated hypothesis implies that there is not enough evidence to reject it.
- 36. Types of Statistical Hypothesis We use the term null hypothesis for the hypothesis we want to test, that is, to either reject or accept, denoted by H0. If the null hypothesis is rejected, the alternative hypothesis, denoted by H1, will then be accepted. The null hypothesis H0 is stated such that it specifies an exact value while the alternative hypothesis H1 is stated such that it allows for the possibility of some certain values. For example, if the null hypothesis H0 is 𝑥 = 8, the alternative hypothesis H1 might be 𝑥 < 8, 𝑥 > 8, or 𝑥 ≠ 8.
- 37. Types of Statistical Tests If the alternative hypothesis of any statistical test is one – sided, for example, H1: 𝑥 < 8 or H1: 𝑥 > 8, it is said to be a one – tailed test. On the other hand, if the alternative hypothesis is two – sided, for example, H1: 𝑥 ≠ 8, the test is said to be two – tailed. Types of Error However deciding whether to accept or reject any statistical hypothesis of a population parameter is critical that it might lead to wrong conclusions. For instance, a researcher could reject H0 when in fact, it is true. Such is called a type I error. Also, one might accept H0 even when it is false. In this case, a type II error occurred.
- 38. Constructing the Null and Alternative Hypothesis A.Testing for Means In hypothesis testing, means, variances, or proportions may be compared so as to justify the need to reject or accept the null hypothesis. But there are many instances that the sample means were compared using experimental and control groups.
- 39. Example 1 1. A researcher wants to know if the average test score of the students taking a particular examination is 80. H0: 𝜇 = 80 (the average test score of the students taking a particular examination is 80) H1: 𝜇 ≠ 80 (the average test score of the students taking a particular examination is not 80) 2. A small group of researchers is conducting a study to show if the average number of hours a student spends on social media sites per day is greater than 10. H0: 𝜇 = 10 (average number of hours a student spends on social media sites per day is 10) H1: 𝜇 > 10 (average number of hours a student spends on social media sites per day is greater than 10)
- 40. 3. A teacher wants to know if there is a difference in the performance of his two classes based on their average grades. H0: 𝜇1 = 𝜇2 (there is no difference in the performance of his two classes based on their average grades) H1: 𝜇1 ≠ 𝜇2 (there is a difference in the performance of his two classes based on their average grades) 4. A researcher wants to study if the customer satisfaction level of a cable television company A is greater than a cable television company B. H0: 𝜇1 = 𝜇2 (the customer satisfaction levels of two competing cable television companies are the same) H1: 𝜇1 > 𝜇2 (the customer satisfaction levels of a cable television company A is greater than a cable television company B)
- 41. 5. A clinical trial is conducted to compare three different weight loss programs based on the average weight measured among three groups at the end of the program. H0: 𝜇1 = 𝜇2 = 𝜇3 (there is no difference on the three weight loss programs) H1: 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 (there is a difference on the three weight loss programs)
- 42. B. Testing for Independence The chi-square (𝜒2 ) test is used to test the independence of two variables. In other words, this test is used to determine whether the two variables are related or not, based on the sample selected from each variable. Example 2 1. A survey is conducted to test if the grades of the students are associated to the number of hours they spend on social media sites. H0: The grades of the students are not associated to the number of hours they spend on social media sites. H1: The grades of the students are associated to the number of hours they spend on social media sites. 2. A study shows that the daily consumption depends on the age level of a person. H0: The daily consumption does not depend on the age level of a person. H1: The daily consumption depends on the age level of a person.
- 43. C. Correlation To determine whether two variables (usually x and y) are linearly related, correlation is the statistical method to be used. In this method, the data collected on two numerical variables are tested to determine the strength of their relationship estimated by the sample correlation coefficient r given by 𝑟 = 𝑛( ) 𝑥𝑦 − ( 𝑥)( ) 𝑦 𝑛( 𝑥2) − 𝑥 2 𝑛( 𝑦2) − 𝑦 2 where −1 ≤ 𝑟 ≤ 1 𝑎𝑛𝑑 𝑛 = number of data pairs
- 44. If the value of 𝑟 is close to positive 1, then there is a strong positive linear relationship between the two variables. If 𝑟 is close to negative 1, there is a strong negative linear relationship between them. However, if the two variables has a weak or no linear relationship, 𝑟 is close to 0. Example 3 1. A study is conducted to show how strong is the relationship between sleeping habit of employees and their level of performance at work. H0: Sleeping habit of employees is not related to their level of performance at work. H1: Sleeping habit of employees is related to their level of performance at work. 2. A student wants to know if his grade in Mathematics is associated to his grade in English. H0: His grade in Mathematics is not associated to his grade in English. H1: His grade in Mathematics is associated to his grade in English.
- 45. Student Hours of Study Grade A B C D E F 7 3 2 6 3 4 83 63 60 88 68 75 3. A researcher wishes to see whether there is a relationship between number of hours of study and test scores on an exam. The following data were obtained.
- 46. Solution: To solve for the correlation coefficient r, we must find first the values of 𝑥𝑦, 𝑥2 , and𝑦2 . Studen t Hours of Study (x) Grade (y) 𝑥𝑦 𝑥2 𝑦2 A B C D E F 7 3 2 6 3 4 83 63 60 88 68 75 581 189 120 528 204 300 49 9 4 36 9 16 6889 3969 3600 7744 4624 5625 𝚺𝒙 = 25 𝚺𝒚 = 437 𝚺𝒙𝒚 = 1922 𝚺𝒙2 = 123 𝚺𝒚2 = 32451
- 47. Substituting the values to the formula, 𝑟 = 6)(1922) − (25)(437 6 123 − 25 2 6 32451 − 437 2 𝑟 = 0.934 Since the correlation coefficient is close to +1, it indicates a strong linear relationship between the number of hours of study and test scores on an exam of students.
- 48. D. Regression Computing the correlation coefficient means determining the strength of the relationship between two numerical variables. When the resulting correlation coefficient is significant, then regression analysis can be done. Regression is used to understand the movement or trend of the given data so predictions can be made. The regression equation is given by 𝑦′ = 𝑎 + 𝑏𝑥 𝑎 = 𝑦)( ) 𝑥2 − ( 𝑥)( ) 𝑥𝑦 𝑛( 𝑥2) − 𝑥 2 𝑏 = 𝑛( 𝑥𝑦) − ( 𝑥)( ) 𝑦 𝑛( 𝑥2) − 𝑥 2 where
- 49. Example 4 Let us take the example in correlation section since a strong linear relationship exists between the number of hours of study and test scores on an exam of students. Solution: Since 𝑥𝑦, 𝑥2 , and𝑦2 are necessary to solve for 𝒂 and 𝒃, we must solve them first. Student Hours of Study (x) Grade (y) 𝑥𝑦 𝑥2 𝑦2 A B C D E F 7 3 2 6 3 4 83 63 60 88 68 75 581 189 120 528 204 300 49 9 4 36 9 16 6889 3969 3600 7744 4624 5625 𝚺𝒙 = 25 𝚺𝒚 = 437 𝚺𝒙𝒚 = 1922 𝚺𝒙2 = 123 𝚺𝒚2 = 32451
- 50. Then we have, 𝑎 = (437)(123) − (25)(1922) 6 123 − (25)2 = 50.451 𝑏 = (6)(1922) − (25)(437) 6 123 − (25)2 = 5.372 Hence, the equation of the regression line is 𝒚′ = 𝟓𝟎. 𝟒𝟓𝟏 + 𝟓. 𝟑𝟕𝟐𝒙 Suppose we want to know the grade (𝒚′ ) of the student if he/she studies in x hours. For example, let 𝑥 = 9. Then, 𝑦′ = 50.451 + 5.372(9) 𝑦′ = 98.80 Let 𝑥 = 5. Then, 𝑦′ = 50.451 + 5.372(5) 𝑦′ = 77.31