CIRCA:Basic Statistical Analysis

From CIRCA

Revision as of 22:02, 10 April 2013 by Recharti (Talk | contribs)
Jump to: navigation, search
T - Test

One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution.

Using R code (using openNLP package)

#import two books from project Gutenburg
pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n")
flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n")

#tokenize sentances
pride.sen <- sentDetect(pride)
flat.sen <- sentDetect(flat)

#get character count for each word
pride.nchar <- nchar(pride.sen)
flat.nchar <- nchar(flat.sen)

#perform T-test
t.test(pride.nchar, flat.nchar)

Which produces as output.

data:  pride.nchar and flat.nchar 
t = -8.1743, df = 1713.104, p-value = 5.726e-16
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
-39.28293 -24.07959 
sample estimates:
mean of x mean of y 
125.7389  157.4202 

The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of Pride and Prejudice versus Flatland.

Personal tools