Text Processing

The crawling agent follows a site's hierarchical structure until amassing 50 megabytes of textual data, including invisible elements such as embedded markup tags or scripting elements. This limit helps manage available storage space and compare sites of heterogeneous size. The following paragraphs describe how the system analyzes more than ten gigabytes of unstructured Web data each week.

  • Pre-Processing: The first step is to remove a document's invisible elements. Splitting the mirrored Web documents into sentences then allows removing redundant segments - e.g. newspaper headlines appearing on multiple pages that would otherwise bias the results.
  • Word Frequencies: To determine the number of references to candidates or environmental topics, a pattern matching algorithm compares the words in a sentence with a list of regular expressions. Such a list has to consider all the common inflections of a term while excluding general expressions with potentially ambiguous meaning.
  • Semantic Orientation: To evaluate the attitude reflected in a sentence, the system measures the distance between the target word and 7,455 positive and negative words taken from a tagged dictionary. Based on the categories of the General Inquirer, a reverse lemmatization yielded additional terms for the analysis by adding plurals, gerund forms, past tense suffixes and other syntactical variations (e.g. manipulate --> manipulates, manipulating, manipulated). For this purpose, we used an adapted version of the English lemma list provided by Y. Someya.
  • Keyword Analysis identifies topics associated with the presidential candidates by comparing term frequencies in sentences that contain the name of a candidate (target corpus) with a reference distribution taken from the sample's complete set of documents (reference corpus). The keywords are then presented in order of decreasing significance, which is computed via a chi-square test of significance with Yates’ correction for a 2 x 2 table.