Skip to main content

Text mining with enhanced named entity recognition, 2017

 Item — Call Number: MU Thesis Mal
Identifier: b7716610

Scope and Contents

From the Collection:

The collection consists of theses written by students enrolled in the Monmouth University graduate Computer Science program. The holdings are primarily bound print documents that were submitted in partial fulfillment of requirements for the Master of Science degree.

Dates

  • Creation: 2017

Creator

Conditions Governing Access

The collection is open for research use. Access is by appointment only.

Access to the collection is confined to the Monmouth University Library and is subject to patron policies approved by the Monmouth University Library.

Collection holdings may not be borrowed through interlibrary loan.

Research appointments are scheduled by the Monmouth University Library Archives Collections Manager (723-923-4526). A minimum of three days advance notice is required to arrange a research appointment for access to the collection.

Patrons must complete a Researcher Registration Form and provide appropriate identification to gain access to the collection holdings. Copies of these documents will be kept on file at the Monmouth University Library.

Extent

1 Items (print book) : 58 pages ; 8.5 x 11.0 inches (28 cm).

Language of Materials

English

Abstract

The goal of this research is to evaluate the usefulness of enhanced named entity recognition for text mining. Text mining is a subpart of data mining. It is the application of data mining techniques to texts in natural language such as English. The data for this project consists news [sic] articles and article titles extracted from the Web. Enhanced name-entity recognition is used to add additional information to the text. Named entities in the text are recognized using the Stanford NER trigger. This identifies phrases that refer to persons, locations, or organizations and tags them with labels indicating the category to which they belong. These strings are used in fetching information from selected DBpedia archive data files. DBpedia is a semantically organized database that is automatically created from the structured content of Wikipedia. Text classification techniques are applied on news article text and article titles with and without the data from enhanced named entity recognition. The classification techniques Naïve Bayes, Decision tree and Random forest are used for text classification in this research. The performance of each technique is evaluated to find the improvement in text classification with enhanced name entity recognition.

Keywords: Text mining, Classification, Named entity recognition, data mining, Stanford NER tagger, DBpedia, Wikipedia, Naïve Bayes, Decision tree, Random forest, Text classification.

Partial Contents

Abstract -- Acknowledgements -- 1. Introduction -- 2. News extraction Python -- 3. Named-entity recognizer (NER) - 4. Categorization with DBpedia -- 5. Data preprocessing -- 6. Text classification -- 7. Classifying articles and titles -- Summary -- Bibliography -- Appendix.

Source

Repository Details

Part of the Monmouth University Library Archives Repository

Contact:
Monmouth University Library
400 Cedar Avenue
West Long Branch New Jersey 07764 United States
732-923-4526