Methods and Means for Big Data Processing

Major: Data Science
Code of subject: 7.124.03.E.025
Credits: 5.00
Department: Information Systems and Networks
Lecturer: doctor of sciences, professor Andrey Berko
Semester: 2 семестр
Mode of study: денна
Learning outcomes: • Knowledge and understanding of the scientific principles of creating large data resources; • the ability to formulate theoretical and practical solutions for creating and filling up the resources of the Great Data; • the ability to use knowledge and skills when writing scripts for processing large data resources; • practical application of knowledge in the processing of large data resources using classification, clustering, predictive analysis, statistical simulation, forecasting.
Required prior and related subjects: • Technology of Distributed Systems and Parallel Computing • Intelligent data analysis • Methods and tools for data and knowledge engineering • Technologies for designing business logic systems
Summary of the subject: 1. The concept of large data Concept and definition of large data. Great Data Properties. Great data requirements. Specifics of Great Data. Classification of large data. Structured data. Sources of large structured data. Relational databases in large numbers. Unstructured data. Sources of unstructured data. The role of CMS in managing large data. Managing heterogeneous data. Integration of various types of data into a large data environment. 2. Evolution of Great Data. Evolution of data management. Stage 1: Creating Managed Data Structures Step 2: Manage websites and content Stage 3: Managing Large Data Processing large volumes of data on MainFrame. Prerequisites and factors in the direction of the Great Data. Formation and development of Large Data Technologies. Subject areas of application of large data. Current state and prospects for the development of large data. 3. Methods of analyzing large data. A / B testing. Classification. Cluster analysis. Crowdsourcing (data collection). Data shuffle and integration. Data mining. Determination of consistency (harmony) of data. Genetic Algorithms. Machine learning Working out natural language. Network analysis. Optimization. Pattern recognition Forecast Simulation. Regression analysis. Signal processing. Spatial analysis of data. Statistics. Simulation (Simulation). Time sequence analysis. Study of associative bindings. Studying functional joints. Studying hidden bindings.
Assessment methods and criteria: • Current control (40%): written reports on laboratory work, abstract, oral questioning • Final control (60%, exam): Written-oral form.
Recommended books: 1. White, Tom // Hadoop: The Definitive Guide // O'Reilly Media, 2009. 2. Hadoop. Apache Software Foundation // http://hadoop.apache.org/ 3. Finley, Klint // Steve Ballmer on Microsoft's Big Data Future and More in This Week's Business Intelligence Roundup // ReadWriteWeb, 2011. 4. Fay Chang, Jeffrey Dean, Sanjay Ghemawat & etc. // Bigtable: A Distributed Storage System for Structured Data // Google Lab, 2006. 5. Сухорослов, O. // Новые технологии распределенного хранения и обработки больших массивов данных // Институт системного анализа РАН, 2008. 6. Jeffrey Dean, Sanjay Ghemawat // MapReduce: Simplified Data Processing on Large Clusters // Google Inc., 2004. 7. Judy Qiu // Cloud Technologies and Their Applications // Indiana University Bloomington, 2010 8. The Hadoop Distributed File System: Architecture and Design // http://hadoop.apache.org/common/docs/r0.17.2/hdfs_design.html 9. Созыкин, А. // Параллельное программирование в Hadoop // http://www.asozykin.ru/courses/hadoop 10. Ralf Lammel // Google’s MapReduce Programming Model — Revisited // Microsoft Corp.