# CIRCA:Basic Statistical Analysis

### From CIRCA

(Difference between revisions)

(→How Statistics Work) |
|||

Line 1: | Line 1: | ||

- | + | ===== T - Test ===== | |

- | + | One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution. | |

- | + | Using R code (using openNLP package) | |

- | ==== | + | #import two books from project Gutenburg |

+ | pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n") | ||

+ | flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n") | ||

+ | |||

+ | #tokenize sentances | ||

+ | pride.sen <- sentDetect(pride) | ||

+ | flat.sen <- sentDetect(flat) | ||

+ | |||

+ | #get character count for each word | ||

+ | pride.nchar <- nchar(pride.sen) | ||

+ | flat.nchar <- nchar(flat.sen) | ||

+ | |||

+ | #perform T-test | ||

+ | t.test(pride.nchar, flat.nchar) | ||

- | + | Which produces as output. | |

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | + | ||

- | === | + | data: pride.nchar and flat.nchar |

+ | t = -8.1743, df = 1713.104, p-value = 5.726e-16 | ||

+ | alternative hypothesis: true difference in means is not equal to 0 | ||

+ | 95 percent confidence interval: | ||

+ | -39.28293 -24.07959 | ||

+ | sample estimates: | ||

+ | mean of x mean of y | ||

+ | 125.7389 157.4202 | ||

- | + | The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of Pride and Prejudice versus Flatland. |

## Revision as of 22:02, 10 April 2013

##### T - Test

One statistic useful in text analysis is the average length of word that is used. Similar vocabularies will generally produce similar distributions in the character length of a word. A test for authorship, for example, could assume that all works by the same author would have a similar vocabulary and therfore a similar average word length. T-Tests are a statistical method that compares the mean of two sample sets and asks if these two sets represent the same distribution.

Using R code (using openNLP package)

#import two books from project Gutenburg pride <- scan(file="http://www.gutenberg.org/cache/epub/1342/pg1342.txt", what='char', sep="\n") flat <- scan(file="http://www.gutenberg.org/cache/epub/97/pg97.txt", what='char', sep="\n") #tokenize sentances pride.sen <- sentDetect(pride) flat.sen <- sentDetect(flat) #get character count for each word pride.nchar <- nchar(pride.sen) flat.nchar <- nchar(flat.sen) #perform T-test t.test(pride.nchar, flat.nchar)

Which produces as output.

data: pride.nchar and flat.nchar t = -8.1743, df = 1713.104, p-value = 5.726e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -39.28293 -24.07959 sample estimates: mean of x mean of y 125.7389 157.4202

The P value in this case is much lower then 1% so there is a strong statistical difference between the average sentance length of Pride and Prejudice versus Flatland.