Data Collection

To acquire content for various knowledge archives, we use a combination of Scrapy for crawling (typically not more than once a week to minimize the resulting load on Web servers), and the Extensible Web Retrieval Toolkit (eWRT) for capturing social media sources. Currently, the majority of data is gathered for the research projects uComp (Embedded Human Computation for Knowledge Extraction and Evaluation) and DIVINE (Dynamic Integration and Visualization of Information from Multiple Evidence Sources), as well as for the Media Watch on Climate Change, an information exploration system to analyze news and social media coverage on climate change an related environmental issues.

The system is configured to ensure that the data collection process respects all settings in robots.txt (a text file placed in the top directory, which is used by site administrators to restrict access to files and directories on a Web server). Please contact us if you are a site administrator and have questions regarding this data collection policy.