Begin with Statistics
Statistics: the conceptual approach
Gudmund R.Iversen and Mary Gergen
Statistics: the conceptual approach is a book on my shelf that I haven’t read for a long time. For statistics, I have had took a beginner class at school many years ago and learned how to use SPSS. Now I have basically no memory of all of these. But when we were doing data analysis for Hekumu project or the data team task, I noticed that statistics is a very important portion. How to collect data, organize data, how to draw conclusions in data analysis, with these question marks I opened this book.
Key takeaway points
Early statistical activity focused on state governments collecting demographic data on their citizens. Therefore, statistics was considered a word for state-related indicators. Later, statistics began to refer to the collection of individual data points, and as a field of study, statistics can be defined as a set of concepts, principles, and methods for collecting data, analyzing data, and drawing conclusions from data. (chapter 1.6). This book focuses on the basic concepts of statistics, data collection, data description, probability, drawing conclusions, and some data analysis methods. The content of the book is very rich, I didn’t read it word by word. Here I will share some parts that I am interested in.
Measurement of data analytics
The measure of central tendency – mode, median, or mean? These three most common statistics averages are used to describe the central tendency of the samples. When we have a set of data, how to know which measure we should use? (see Table 1.)
We often face situations where people need to choose one of the answers like “voting” is generally the way of making a decision. That is, when all proposals and corresponding votes are counted, check whether the mode of each vote exceeds half of the total votes. If the mode exceeds half of the total votes, the final answer selected is this mode. If there is a double mode, the final answer can be selected by drawing lots, tossing a coin, or negotiating for the two modes.
The median is the middle position in the numerical ordering of a set of data. It is not affected by the largest or smallest number. When the median monthly income of people in a city is 5,000 euros per month, it means that half of the people have a monthly income below this figure, and the other half have a monthly income above this figure.
Mean is the most commonly used average variable. Due to it being calculated by the sum of all the numbers in the observation data set, it is sensitive to individual extreme values. To decide whether to use the mean or the median for a set of data, the author’s suggestion is to work out both. If their values are close, we use the mean, if they are very different, we use the median.
|Mode||-Easy to get from the data set
-Less sensible to extreme values in the data set
|-Can transfer very limited information from the data set||-Reflects the level of central tendency. In daily life, such as “the best..”, “the most voted..”, etc are related to the mode|
|Median||-It is obtained by sorting
-Less sensible to extreme values in the data set
|-Does not utilize other observations except the median||– When the histogram of a data shows an asymmetric distribution, often use the median|
|Mean||-It makes use of every observation and gets more information||-Very sensible to extreme values in the data set||– Mostly used in a set of data without big value differences|
Table 1. The mode, the median, and the mean (Chapter 4.1)
The measure of differences – range, standard deviation, and standard error. The range is very simple to get (the difference between the largest and the smallest observations), and it is a useful number. Standard deviation is the most commonly used statistic when we need to consider the dispersion of the data. It describes the average distance of observations from the mean. The further the value is from the mean, the greater the standard deviation is (Chapter 4.2) Standard error is the standard deviation calculated from a set of means. Simply, when we want to know the degree of dispersion of individual scores in sampling, we use standard deviation. When we want to know the degree of dispersion between sample statistic values in multiple sampling, we use the standard error formula to calculate.
The Normal distribution
The standard normal distribution (see Figure 1), also known as the Gaussian distribution or the bell curve, is a very common continuous probability distribution. It is important in statistics and is often used in the natural and social sciences to represent an unknown random variable (normal distribution, Wiki). This distribution has two parameters the mean, and standard deviation.
In fact, in our daily life, many things conform to normal distribution. For instance, in modern IQ tests, the raw score is transformed into a normal distribution with a mean of 100 and a standard deviation of 15. (IQ, Wiki). Besides, the employee performance curve, exam result, product quality, etc are all follow a normal distribution.
Figure 1. Normal distribution (the bell curve)
Probability and Decision Analysis
The author provides an introduction to probability and how to use statistical probability in decision analysis. In fact, in business management, many decisions making determines whether the enterprise can make a profit and whether it can develop and expand. Especially for start-ups, investment in products, hardware, technology, and talents requires more scientific analysis in order to make decisions that are conducive to long-term development. This is new to me that investment risk can be analyzed by using probabilistic methods.
Regression analysis describes how a change in one or more independent variables affects the dependent variable (Chapter 10). One example is if we know the following formula,
Estimated energy =36.1 + 15.3 * fat
In the shop, we see a candy bar containing 3g of fat, we could estimate the candy bar’s caloric value. This is also called linear regression as it uses a single independent to predict an outcome of the dependent variable (Chapter 10.3). When many independent variables affect one dependent variable, it is called multiple regression. Another example is that we can estimate the product price by finding the relationship between price and influencing factors and build a regression model. It is a useful test as it enables us to figure out the precise impact of a change in an independent variable.
The feature of this book is describing the basic principles and methods of statistics by solving practical problems, plus plenty of pictures so that it is easier for readers to understand. Each chapter of this book begins with examples and questions and ends with summaries, additional readings, formulas, and exercises. For me, it is not so convenient that I need to look for the formula page for the explanations after reading the definition and application of the concept.
In addition to the above, the author introduces such as how to collect data, how to describe data, probability distributions for discrete variables, estimation, and hypothesis testing, Chi-square analysis for two categorical variables, ANOVA analysis, multivariate analysis, and so on. I couldn’t understand all the content just by reading it. But as an introductory book for statistics majors, I think it broadens my eyes and gives me a deeper understanding of data analysis. It will help me to conduct research, understand better the journals, and develop analytical thinking skills, which will be a good foundation for the research and data analysis for my thesis next year.
Gudmund R. Iversen, Mary Gergen, 1997. Statistics: the conceptual approach
Normal distribution: https://en.wikipedia.org/wiki/Normal_distribution