# CIRCA:Basic Statistical Analysis

(Difference between revisions)
 Revision as of 20:41, 10 April 2013 (view source)Recharti (Talk | contribs) (→What Statistics is Statistics)← Older edit Current revision as of 22:27, 10 April 2013 (view source)Recharti (Talk | contribs) (→Rule of Large Numbers) (8 intermediate revisions not shown) Line 1: Line 1: - [This page is not finished.] = Basic Statistical Analysis = = Basic Statistical Analysis = Line 5: Line 4: == Definition == == Definition == - ==== What is Statistics ==== + ==== What are Statistics ==== Statistics is a quantitative and numerical method that is primarily involved with the collection and interpretation of statistical data. Statistics are frequently used to find relationships between data sets, extracting important properties from data sets, and visualizations. Statistics is a quantitative and numerical method that is primarily involved with the collection and interpretation of statistical data. Statistics are frequently used to find relationships between data sets, extracting important properties from data sets, and visualizations. Line 17: Line 16: ==== Random Number ==== ==== Random Number ==== - Statistics assumes that all measurments are random. The true measurement of any numerical value is not a specific value but instead follows some form of probability distribution. In the general case, the standard distribution is assumed; however, there are numerous other mathematical distributions available for comparison. + Statistics are generally collected from a representative sample of a larger group of object. Statistics always assumes that each measurement is a single randomly generated value that follows a distribution property associated with the larger group. The most commonly assumed probability distribution is the standard distribution which assumes that most measurements will be close to the classes mean value; variations from the mean are possible but get rarer as distance from the mean increases. One of the main goals of statistics is to find this probability distribution and use it to make inferences about objects outside the representative set. + + There are numerous other mathematical distribution available for study. ==== Rule of Large Numbers ==== ==== Rule of Large Numbers ==== - Statistics are generally collected for a subset of the group of objects under study. For example, as it is nearly impossible to collect data on every book ever written, a representative subset can be studied instead. Any subset of a larger dataset will not accuratly represent the true distribution of data. + The rule of large numbers is a simple relationship between the distribution of a sample set, and the distribution of the entire class of objects. In general, larger sample sizes better represent the class as a whole. - The Rule of large numbers governs the relationship between subsets and data sets. It states that the larger the subset the more accuratly the distribution of the subset will reflect the larger set. For example, rolling a six sided dice sixty times will produce a distribution closer to the actual distribution then simply rolling the same dice ten times. + For a dice has an even probability distribution. In theory each side of a dice has equal probability of coming up if rolled. If I were to roll ten dice, then the probability distribution will probably not come up even with some outcomes coming up much more frequently then others. If I were to roll the same dice sixty times, the distribution would be much more even. The rule of large numbers stats that the more I roll the dice, the closer the outcomes will be. + + [[IMage:Dice.png|Dice distribution]] ==== The P Value ==== ==== The P Value ==== - Because statistics makes use of probabilities, there is allways a chance that percived relationships can emerge from data where no relationship actually exists. Most statistical methods will produce a P value which is effectivly the probability of such a coincodence. A low P value indicates a high probability that there is real relationship in the data while a high P value indicates that any percived relationship is probobly due to random fluctuations in the data. + Because statistics makes use of probabilities, there is always a chance that perceived relationships can emerge from data where no relationships actually exists. Most statistical methods will produce a P-value which is effectively the probability of such a coincidence. A low P value indicates a high confidence that the outcome does not represent random fluctuations in the data. High p-values represent low confidence and indicate no relationship beyond simple noise. - It is important to pick a P value prior to performing any statistical analysis. In general a P value less then 5% is desired; however, for important or sensative data P values lower then 1% and lower are generally desired. + It is important to pick a P value prior to performing any statistical analysis. In general a P value less then 5% is desired; however, for important or sensitive data P values of 1% or lower are generally desired. - ==== Interpreting Corelation ==== + ==== Interpreting Correlation ==== - A strong statistical relationship between two data sets does not imply that one causes the other. It only implies that one is a good predictor of the other, or that these two values are somehow related. Statistical relationships cannot be used to prove that one object is causally linked to another object; only that these two objects have a relationship that merits further study. + Statistics never produce a causal model of anything that it analyses. A strong statistical relationship between two data sets does not imply that one causes the other. It only implies that one is a good predictor of the other, or that these two values are somehow related. Statistical relationships cannot be used to prove that one object is causally linked to another object; only that these two objects are somehow related. ==Examples == ==Examples == - ===== T - Test ===== ===== T - Test ===== - ===== Linear Regression ===== + One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution. + + Using R code (using openNLP package) + + #import two books from project Gutenburg + pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n") + flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n") + + #tokenize sentances + pride.sen <- sentDetect(pride) + flat.sen <- sentDetect(flat) + + #get character count for each word + pride.nchar <- nchar(pride.sen) + flat.nchar <- nchar(flat.sen) + + #perform T-test + t.test(pride.nchar, flat.nchar) + + Which produces as output. + + data:  pride.nchar and flat.nchar + t = -8.1743, df = 1713.104, p-value = 5.726e-16 + alternative hypothesis: true difference in means is not equal to 0 + 95 percent confidence interval: + -39.28293 -24.07959 + sample estimates: + mean of x mean of y + 125.7389  157.4202 + + The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of 'Pride and Prejudice' and 'Flatland'. ==References and Further Readings== ==References and Further Readings== + + [http://en.wikipedia.org/wiki/Statistics Statistics Wikipedia Article] + + [http://www.r-project.org/ R Project Website] + + [http://www.vassarstats.net/ VassarStats Website] + + [http://www.gutenberg.org/ Project Gutenburg]

# Basic Statistical Analysis

## Definition

#### What are Statistics

Statistics is a quantitative and numerical method that is primarily involved with the collection and interpretation of statistical data. Statistics are frequently used to find relationships between data sets, extracting important properties from data sets, and visualizations.

As Stats are primarily a numerical method, they only produce facts about the numbers themselves and not the objects under study. A statistical relationship is only as good as the validity of the data being analyzed. A firm understanding of both the statistical method and the numerical interpretations are important in order to get the most out of any statistical analysis.

## How Statistics Work

Statistics makes a number of very broad assumptions about any data set that is being studied. The most important assumptions will be discussed below.

#### Random Number

Statistics are generally collected from a representative sample of a larger group of object. Statistics always assumes that each measurement is a single randomly generated value that follows a distribution property associated with the larger group. The most commonly assumed probability distribution is the standard distribution which assumes that most measurements will be close to the classes mean value; variations from the mean are possible but get rarer as distance from the mean increases. One of the main goals of statistics is to find this probability distribution and use it to make inferences about objects outside the representative set.

There are numerous other mathematical distribution available for study.

#### Rule of Large Numbers

The rule of large numbers is a simple relationship between the distribution of a sample set, and the distribution of the entire class of objects. In general, larger sample sizes better represent the class as a whole.

For a dice has an even probability distribution. In theory each side of a dice has equal probability of coming up if rolled. If I were to roll ten dice, then the probability distribution will probably not come up even with some outcomes coming up much more frequently then others. If I were to roll the same dice sixty times, the distribution would be much more even. The rule of large numbers stats that the more I roll the dice, the closer the outcomes will be.

#### The P Value

Because statistics makes use of probabilities, there is always a chance that perceived relationships can emerge from data where no relationships actually exists. Most statistical methods will produce a P-value which is effectively the probability of such a coincidence. A low P value indicates a high confidence that the outcome does not represent random fluctuations in the data. High p-values represent low confidence and indicate no relationship beyond simple noise.

It is important to pick a P value prior to performing any statistical analysis. In general a P value less then 5% is desired; however, for important or sensitive data P values of 1% or lower are generally desired.

#### Interpreting Correlation

Statistics never produce a causal model of anything that it analyses. A strong statistical relationship between two data sets does not imply that one causes the other. It only implies that one is a good predictor of the other, or that these two values are somehow related. Statistical relationships cannot be used to prove that one object is causally linked to another object; only that these two objects are somehow related.

## Examples

##### T - Test

One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution.

Using R code (using openNLP package)

```#import two books from project Gutenburg
pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n")
flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n")

#tokenize sentances
pride.sen <- sentDetect(pride)
flat.sen <- sentDetect(flat)

#get character count for each word
pride.nchar <- nchar(pride.sen)
flat.nchar <- nchar(flat.sen)

#perform T-test
t.test(pride.nchar, flat.nchar)
```

Which produces as output.

```data:  pride.nchar and flat.nchar
t = -8.1743, df = 1713.104, p-value = 5.726e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-39.28293 -24.07959
sample estimates:
mean of x mean of y
125.7389  157.4202
```

The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of 'Pride and Prejudice' and 'Flatland'.