Investigation of Big Data Clustering in Commercial Systems

Students Name: Lanchevych Ruslan Orestovych
Qualification Level: magister
Speciality: System Administration of Telecommunications Networks
Institute: Institute of Telecommunications, Radioelectronics and Electronic Engineering
Mode of Study: part
Academic Year: 2022-2023 н.р.
Language of Defence: ukrainian
Abstract: The amount of data to be processed is growing every moment. Replacing outdated technologies with new and more intelligent ones helps to provide new services to users. If earlier the phone was only a means of communication, now it is a full-fledged personal computer that contains a lot of private information about the owner. Innovative cloud technologies allow you to store and process data on remote resources, use rented services, and share your results with others. Social networks have also long become an important component of human life. Not only individuals but also entire enterprises post data about their own activities, thus interacting with the audience. It is clear that information from different sources can be of different types. A book’s rating on a website is an example of numerical data. Feedback on a product or email is text data. A post on a social network can be an image that, however, contains some information [1,2]. The quality of user service in modern information systems is a very important task for the solution of which considerable resources are spent. Since customers who use certain applications or web services will not expect a long time to fulfill their requests, slow data processing will cause a decline in interest in using this type of product. High competition in the application market requires the use of the most modern technologies to improve their functions. Digitization of various spheres of life also led to the creation of information systems to perform tasks that were previously performed by humans [3]. Machine learning uses data sets that allow it to learn how to solve certain problems. At the input of the algorithm, training sequences are given, which the machine must use to form the correct result at the output. The possibility of self-correction of the machine learning algorithm improves the accuracy of information processing. Training can take place both under supervision, when the data that should be at the output of the model is determined, and more independently. Thus, the system itself tries to determine the optimal parameters at which tasks are best performed. The selection of the most appropriate values for model training is often iterative. At the same time, the data from the system output is transferred to the input to correct the results of the previous iteration. Machine learning allows information systems to process large data sets faster and more reliably, find patterns, and communicate with users. The variety of services provided by modern computing systems determines significant amounts of information that must be processed [4,5]. The complexity of big data processing methods is their diversity and unstructuredness. Even for relatively small arrays of information, there is a problem with presenting it in a form convenient for the end user to perceive. Various methods and means of data optimization are used. At the same time, the most important ones are determined, and the information is grouped for more convenient further processing. The clustering of data divides them into different groups, depending on belonging to one or another class, according to features that are important in solving the tasks. Arrays of information coming from different users should be divided into groups containing similar elements. Thus, for example, social media posts are a different type of data than product sales statistics. Having singled out all types of data, you can apply the most optimal method of processing [6]. Thanks to the timely selection of tools for working with a specific set of information, fewer computing resources are spent. Cluster analysis improves the efficiency of processing data from users, and allows simple and reliable machine learning based on prepared information. The definition of groups of data by a separate subject contributes to the accuracy of further analysis and taking into account their specific features [6,7]. Dividing information into subsets shows its diversity and allows reducing the dimensionality of the data. Thus, the clustering of big data in information systems is a very relevant topic today. Study object - Clustering of big data. Scope of research - Research of methods and ways of data clustering. Goal of research - Study of methods and ways of clustering large volumes of information to improve the efficiency of information communication systems. The study of existing methods of data clustering was carried out in work. The peculiarities of the operation of information systems with large arrays of information are determined. The modified KNN-BLOCK DBSCAN algorithm was considered, and the effectiveness of its application for clustering large data was revealed.