Education at Lviv Polytechnic

An analysis system for extracting information from textual data using an artificial neural network

Students Name: Lynnyk Roman Oleksandrovych

Qualification Level: magister

Speciality: Data Science

Institute: Institute of Computer Science and Information Technologies

Mode of Study: full

Academic Year: 2022-2023 н.р.

Language of Defence: ukrainian

Abstract: Every year, the field of natural language processing is gaining more and more popularity. One of the main foundations of this direction and current scientific research is the extraction of valuable information from text documents. This is a very relevant topic in today’s world, because there is more and more information, and there is not always time to process everything by yourself, so every year this field is becoming more and more popular and is used in many top IT companies in the world. The problem with all this information is that it can be extremely difficult for people to keep up with everything they need to know, especially those who have to read a lot of text to get the gist of what they need to know. Mining information from textual data is one such solution. This allows us to find time to read something important without spending too much time on it. In this way, we will save our time and energy, as well as reduce the level of stress [1]. The purpose of the study is to model and develop an analysis system for the extraction of short valuable meaningful information from a large array of text data for a quick understanding of the work context. The object of the study is the process of analyzing and extracting valuable textual data from large data sets and formulating short theses with a description of the content. The subject of the research is methods and principles of information extraction from a large set of textual data. The result of the research is a system that extracts data from a large set of textual data using a recurrent neural network and natural language processing, which helps to understand the content of the document without directly reading it. The main task of extracting information from a large set of textual data is to help a person process large streams of data without spending a lot of time and effort. It often happens when searching for some data that it is necessary to reread a document for a long time, which is quite time-consuming, or a situation when a large file is sent for work that needs to be processed quickly. In fact, this system will provide for the simplification of work with large data and speeding up work as a whole. Extracting the main content of the text is the problem of creating a short, accurate and smooth summary of a large text document. Automatic methods of text summarization are very necessary to deal with the ever-increasing amount of textual data available on the Internet, both to better help find relevant information and to consume it more quickly. In general, there are two main types of obtaining the content of processed information from text data: • Extraction of main sentences with the greatest content weight. • Creation of new sentences based on processed information. In this master’s qualification work, the second type was considered. This technique involves creating completely new phrases that convey the meaning of the input sentence. The main idea is to put a strong emphasis on form - to create a grammatical summary that requires advanced language modeling techniques [2]. To build such a neural network, the Encoder-Decoder (also Sequence-Sequence) algorithm is used, which was first presented in 2014 at the Google conference. This model aims to match fixed-length input data with fixed-length output data, where the length of the input and output data may differ, consists of three main parts: an encoder, an intermediate vector encoder, and a decoder. A stack of multiple recurrent units (LSTM or GRU cells for better performance) where each takes a single element of the input sequence, collects the information for that element, and propagates it forward. Next, the intermediate vector obtained from part of the encoder model aims to encapsulate the information for all inputs to help the decoder make accurate predictions, and at the end we compute the outputs using the hidden state at the current time step along with the appropriate weight [3]. Keywords - a large set of text data, lstm, encoder, decoder, text summarization, relevant information. List of used literary sources: 1. Sentence compression by deletion with lstms: / C.A.C.-m L.K. Katija Filippova, Enrique Alfonseca, O.Vinyals., 2017 – 42 с. 2. A neural attention model for abstractive sentence summarization: / S.C. Alexander M. Rush, J. Weston., 2015 – 32 с. 3. Abstractive sentence summarization with attentive recurrent neural networks: / M.A. Sumit Chopra, A.M. Rush., 2016 – 53 с.