
Now available online at: https://dspace.library.uu.nl/handle/1874/374917
Abstract
It is estimated that the world’s data will increase to roughly 160 billion terabytes by 2025, with most of that data occurring in an unstructured form. Today, we have already reached the point where more data is being produced than can be physically stored. To ingest all this data and to construct valuable knowledge from it, new computational tools and algorithms are needed, especially since manual probing of the data is slow, expensive, and subjective. For unstructured data, such as text in documents, an ongoing field of research is probabilistic topic models. Topic models are techniques to automatically uncover the hidden or latent topics present within a collection of documents. Topic models can infer the topical content of thousands or millions of documents without prior labeling or annotation. This unsupervised nature makes probabilistic topic models a useful tool for applied data scientists to interpret and examine large volumes of documents for extracting new and valuable knowledge. This dissertation scientifically investigates how to optimally and efficiently apply and interpret topic models to large collections of documents. Specifically, it shows how different types of textual data, pre-processing steps, and hyper-parameter settings can affect the quality of the derived latent topics. The results presented in this dissertation provide a starting point for researchers who want to apply topic models with scientific rigorousness to scientific publications.
The research was funded by the project SAF21, “Social Science Aspects of Fisheries for the 21st Century”:http://saf21.eu. SAF21 is a project financed under the EU Horizon 2020 Marie Skłodowska-Curie (MSC) ITN – ETN program (project 642080).