Monday, April 27, 2015

Introduction on statistics

Statistics deals with all aspects of the collection, processing, presentation, and interpretation of measurements or observations, that is , with aspects of the handling of data. Thus, data constitutes the raw material we deal with in statistics, and its collection is of major concern in any statistical investigation.

Data is obtained by taking measurements, by counting, by asking questions, or by referring to data made available in published form. Note that that we said"data constitutes" and "data is," even though data is the plural form of datum, a term that in actual practice is rarely used, We do this because all the data is usually regarded as one unit of information and, hence, used with a singular verb.. 

The origins of modern statistics can be traced to two areas which, on the surface, have very little in common: government (political science) and games of chance.

Governments have long used censuses to count persons ad property. The ancient Romans used this technique to assist in the taxation of their subjects; indeed, the Bible tells how Mary and Joseph, subjects of Rome, went to Bethlehem to have their names listed in a census, Another famous census is reported in Doomsday Book of William of Normandy, completed in the year 1086. This census covered most of England, listing its economic resources, including property owners and the land which they owned. The U.S. census of 1790 was the first "modern" census, but government agents merely counted the population. More recently U.S. censuses have become much wider in scope, providing a wealth of information about the population and the economy. They are conducted every ten years (in the years that end in zero such as 1980 and 1990). 

The problem of describing summarizing, and analyzing census data led to the development of methods which, until recently, constituted almost all that there was to the subjects of statistics. These methods, which originally consisted mainly of presenting the most important features of data by means of table and charts, constitute now referred to as descriptive statistics.To be more specific, this term applies this term applies to anything done to data that does not infer anything which goes (generalizes) beyond the data itself. Thus, when the government reported on the basis of census counts that the population of the United States was 248,718,301 people in 1990 and 281, 471, 906 in 2000, this belonged to the field of descriptive statistics. This would also be the case if we calculated the corresponding percentage growth which, as can easily be verified , was 13.2 %, but not if we had used the 1990 data to predict, say, the population of the United States in the year 2020. Such a prediction goes beyond the available information, and we must shift our emphasis from descriptive statistics to statistical inference or inductive statistics.

Statistics has grown from the art of constructing charts and tables to the science of basing decisions on numerical data, or even more generally the science of decision making in the face of uncertainty. It is here that we must use statistical methods which find their origin in games of chance.

Games of chance date back thousands of years, as evidenced, for example, by the use of astragali (the forerunners of dice) in Egypt about 3500 B.C., but the mathematical study of such games began in the year 1654, when Blaise Pascal ( a Mathematician) wrote to Pierre de Fermat (another Mathematician) with regard to a gaming problem. They solved the problem independently, using different mathematical methods. It may seem surprising that it took so long, but until then chance was looked on as an expression of divine intent, and it would have been impious (showing a lack of respect for God or religion), or even sacrilegious (when the sacrilegious offence is verbal, it is called blasphemy, and when physical, it is often called desecration), to analyze the "mechanics" of the supernatural through mathematics.

Although the mathematical study of games of chance, called probability theory, dates back to the seventeenth century, it was not until the early part of the nineteenth century that the theory developed for "heads or tails," for example, or "red or black" or "even or odd," was applied also to real-life non-gambling situations where outcomes were "boy or girl," "life or death," "pass or fail," and so forth.
Thus probability theory was applied to many problems in the social as well as the physical sciences, and nowadays it provides an important tool for the analysis of any situation in business, in science, or in everyday life where there is an element of uncertainty or risk. In particular, it provides the basis for methods which we use when we generalize from observed data, namely, when we use the methods of statistical inference.

Fundamentally, there are two types of data: numerical and categorical.

Numerical data contains numbers that can be treated by ordinary arithmetical methods. For instance, if we counted the numbers of passengers on three buses, we might get 32, 41, 28 passengers. To obtain the total number of passengers, we simply add the values 32, 41, and 28 and obtain 101. If necessary, we can multiply, divide, subtract, raise to powers, and extract roots of these values.

Categorical data results from data being sorted into nonnumerical categories. For instance, an interviewer determines whether a person is single (never married), married, a widow or widower, or divorced, this information has been categorized. Questions that result in a choice of answers such as yes or no; agree or disagree; true or false;or poorly, acceptably; or superior tasting; result in categorical data. For ease in manipulating categorical data, it is sometimes coded. In the maritus status illustration, the categories of single (never married), married, widow/widower, or divorced could be assigned code numbers of 1, 2, 3, and 4, respectively.

Data may also be classified as nominal, ordinal, interval, or ratio.

Nominal data cannot be manipulated arithmetically, however, it focuses on the frequency in a category and shows clearly the number of respondents who fall into each class.

Ordinal data can be ranked ordered.

If we can form differences, but not multiply or divide, we refer to the data as interval data.

For ratio data we can form quotients, that is, divide one quantity by another quantity of the same kind. Both the dividend and divisor must be expressed in the same units.

If a set of data consists of all conceivable possible (or hypothetically possible) observations of a certain phenomenon, we call it a "population";

if a set of data consists of only a part of these observations, we call it a sample.

In all statiscal studies that use samples, great care must be exercised to ensure that data lends itself to valid generalizations. A key issues here is the question of bias. A sample is said to be bias if it not representative of the population that it is supposed to represent. Every precaution must be taken to avoid inadvertent biases. It is, of course, unethical to introduce deliberate biases to prove particular points.





















No comments: