MSc Dissertation: "A Novel Stemming Algorithm for Albanian in a Data Mining Approach for Document Classification", Jetmir Sadiku

Abstract.

This dissertation deals with the design and building a stemming algorithm for the Albanian language and than using it to classify a corpus of documents. The work is based on research on stemming algorithms of other languages and the morphology of Albanian. Text Mining is a knowledge-intensive technique that is used to interact with a collection of documents by employing a set of analysis tools. Data/Text Mining (data can be text) is becoming a very useful process today for gathering information based on stored data. The most useful fields where data mining helps most are medicine, banking, finance, marketing, spam filtering etc.

A stemming algorithm is a procedure that removes the suffixes from the words providing the root (stem) of the words. For example, the words player and playing have the same root play. Stemming is also needed in search engines to reduce the number of words with the same stem giving a reduced number of indexes. Because of that, the database used to store materials is reduced and the searching time is low. Stemming Albanian is needed by the institutions, Universities, Government etc. to filter e-mail communication that is in the Albanian language. This dissertation represents a first set of rules for Albanian that will be used in a stemming algorithm and for the first time, a list of stopwords of Albanian will be represented. Further this dissertation discusses parts that are related with document pre-processing, mostly based on computational linguistics.

For the full version of the thesis contact Jetmir Sadiku at jetmirsadiku@unyt.edu.al.