CIRCA:Basic Statistical Analysis
From CIRCA
T - Test
One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution.
Using R code (using openNLP package)
#import two books from project Gutenburg pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n") flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n") #tokenize sentances pride.sen <- sentDetect(pride) flat.sen <- sentDetect(flat) #get character count for each word pride.nchar <- nchar(pride.sen) flat.nchar <- nchar(flat.sen) #perform T-test t.test(pride.nchar, flat.nchar)
Which produces as output.
data: pride.nchar and flat.nchar t = -8.1743, df = 1713.104, p-value = 5.726e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -39.28293 -24.07959 sample estimates: mean of x mean of y 125.7389 157.4202
The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of Pride and Prejudice versus Flatland.