A service for finding plagiarism in the text

Students Name: Nadakhovskyi Serhii Serhiiovych
Qualification Level: magister
Speciality: Computer Systems and Networks
Institute: Institute of Computer Technologies, Automation and Metrology
Mode of Study: full
Academic Year: 2020-2021 н.р.
Language of Defence: ukrainian
Abstract: In this master’s qualification work, a service for plagiarism testing is designed. The paper analyzes various methods of data collection to determine the level of plagiarism in the entered data, provides basic characteristics and a general description of known principles of website development, describes specific steps of product development and design and calculates the approximate cost of the application. The goal is to develop a web application that will allow users to quickly and easily determine the level of plagiarism in the submitted text. The structural scheme and the block diagram of the algorithm have been developed. A prototype of the future site was also created for visualization. Performing a master’s thesis, an in-depth analysis of various approaches and algorithms for the most optimized and optimal solution. An important role in the field of information technology is played by a set of methods of information processing and analysis (data mining). We first focus on the specific concept of "similarity": similarity is calculated by looking at the relative size of their intersection. This concept of similarity is called "Jakard’s similarity". We consider some of the applications of finding similar sets. These include searching for text-like documents and co- filtering by finding similar customers and similar products. To turn the problem of textual similarity of documents. Determining a similar position at one of the intersections we use a technique called "cladding". Problem solving. From the previously considered algorithms during the analysis was selected: "Jakard’s algorithm". 7 Jakard’s algorithm is presented in the form of sets S and T are | S U T | / | S?T |, ie the ratio of the size of the intersection of S and T to the size of their union. We denote the similarity of Jakard’s and T by SIM (S, T). An important class of problems faced by Jakarta’s similarity is that finding text-like documents in a large body, such as a news collection from Webor. We need to understand that the aspect of similarity is that we look at similarity at the character level, not at a "similar meaning" that requires us to study words in documents and use them. This problem is also interesting, but is addressed by other methods, which we hinted at in this section. However, textual similarity is also important. Many of them involve finding duplicates or close duplicates. First, note that testing whether or not two documents are exact duplicates is easy; just compare the two documents character by character, and if they ever diverge, they are not the only ones. However, in many applications the documents are not identical, but they separate large parts of their text. Plagiarized document testing tests our ability to find similar text. The plagiarist can extract only some parts of the document for his own. It can open several words and can change the order in which the sentences of the original appear. However, the received document may still contain 50% or more of the original. A simple process of comparing plagiarism documents. The most effective way to represent documents as sets, in order to identify lexically similar documents, is to construct from the document a set of short lines that appear in it. If we do, it is a document that short divisions as sentences or even phrases will have many common elements in their sets, even if these sentences appear in different orders in the two documents. In this section, we introduce the simplest and most common approach, cladding, as well as an interesting variation. The object of research is a system of finding similarities in the text 8 The subject of the research is a system for finding plagiarism in the text, using the Jakard algorithm and optimizing iterations. The purpose of the study: to create a web application that, using a database and API, will allow the user to determine the percentage of plagiarism in the text under study. Research results: All types of plagiarism, algorithms, and types of systems for finding the level of plagiarism in the text were considered and analyzed. List of used literature sources: 1. Academic Integrity Tutorial / University of Maryland University College. 2015. 2. Bilis-Zulle L., Frkovis V., Turk T., Azman J., Petroveeki M. Prevalence of Plagiarism Medical Students // Croat Med. J.-2005.-No46 (1).-Р. 126-131. 3. Carroll J., Zetterling C.-M. Guiding students away from plagiarism/– [Stockholm ?]: KTH Vetenskap Och Konst Learning Lab, 2009.-84 р. 4. Crews K.D. Copyright Law for Librarians and Educators : Creative Strategies and Practical Solutions / Chicago : ALA, 2006.-141 р. 5. Gilmore B. Plagiarism: A How-Not-to Guide for Students / Portsmouth, NH: Hienemann, 2009.-104 р. 6. Gilmore B. Plagiarism: Why it happens and how to prevent it / Portsmouth, NH: Hienemann, 2008.-144 р.