# CIRCA:Basic Statistical Analysis

##### T - Test

One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution.

Using R code (using openNLP package)

```#import two books from project Gutenburg
pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n")
flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n")

#tokenize sentances
pride.sen <- sentDetect(pride)
flat.sen <- sentDetect(flat)

#get character count for each word
pride.nchar <- nchar(pride.sen)
flat.nchar <- nchar(flat.sen)

#perform T-test
t.test(pride.nchar, flat.nchar)
```

Which produces as output.

```data:  pride.nchar and flat.nchar
t = -8.1743, df = 1713.104, p-value = 5.726e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-39.28293 -24.07959
sample estimates:
mean of x mean of y
125.7389  157.4202
```

The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of Pride and Prejudice versus Flatland.