Basic Statistical Analysis

From CIRCA

Jump to: navigation, search
VTracker
Content deleted. (47 Occurances)
Content inserted. (10 Occurances)
Content structure inserted. (8 Occurances)

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

T - Test
*

*

One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution.

Using R code (using openNLP package)


#import two books from project Gutenburgpride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n")flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n")#tokenize sentancespride.sen <- sentDetect(pride)flat.sen <- sentDetect(flat)#get character count for each wordpride.nchar <- nchar(pride.sen)flat.nchar <- nchar(flat.sen)#perform T-testt.test(pride.nchar, flat.nchar)

Which produces as output.


data:  pride.nchar and flat.nchar t = -8.1743, df = 1713.104, p-value = 5.726e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:-39.28293 -24.07959 sample estimates:mean of x mean of y 125.7389  157.4202

The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of Pride and Prejudice versus Flatland.

Personal tools