Seren Corpus
About the Corpus
In linguistics, a corpus is a large and structured set of texts used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
The Seren Corpus is a growing collection of articles taken from wikinews English language pages with an emphasis on the latest news items to reflect current use of language online in the English language.
Other Text Analysis Software
Since 1999, in collaboration with the late Dr John Olsson, we developed a range of text tools to help in the analysis of texts including tools for word occurrence, comparing phrases in two separate texts and an analysis of percentage of words in common across texts. These are free for you to use - please contact us for licensing versions with no limits on text sizes and for additional bespoke tools for textual analysis. Mike Slater thetext.co.uk
Word Occurrence Script
Occurence of words in a text based on word length
Comparing Phrases
Phrases of six words in length between two texts, then five, four, three, two
Percentage of Words in Common
Number of words and the number of instances of each word in common