Big Data Spark Development & Data Science Using Scala
Data Science Training with Apache Spark, an open source cluster computing system, is growing fast. Apache Spark has a growing ecosystem of libraries and framework to enable advanced data analytics. Apache Spark’s rapid success is due to its power and ease-of-use. It is more productive and has faster runtime than the typical MapReduce BigData based analytics. Apache Spark provides in-memory, distributed computing. It has APIs in Java, Scala, Python, and R. The Spark Ecosystem is shown below.
The entire ecosystem is built on top of the core engine. The core enables in-memory computation for speed and its API has support for Java, Scala, Python, and R. Streaming enables processing streams of data in real time.
The reason people are so interested in Apache Spark is it puts the power of Hadoop in the hands of developers. It is easier to set up an Apache Spark cluster than a Hadoop Cluster. It runs faster. And it is a lot easier to program. It puts the promise and power of Big Data and real-time analysis in the hands of the masses.
- Data Scientist
- Data Analyst
- Project managers
Introduction of Scala
- The importance of Scala
- The concept of REPL (Read Evaluate Print Loop)
- Deep dive into Scala pattern matching, type interface, higher-order function, currying, traits, application space and Scala for data analysis.
Executing the Scala code
- Learning about the Scala Interpreter
- Static object timer in Scala
- Testing String equality in Scala
- Implicit classes in Scala
- The concept of currying in Scala
- Various classes in Scala.
The Classes concept in Scala
- Learning about the Classes concept
- Understanding the constructor overloading
- The various abstract classes
- The hierarchy types in Scala
- The concept of object equality
- The Val and var methods in Scala.
Case classes and pattern matching
- Understanding Sealed traits, wild, constructor, tuple, variable pattern, and constant pattern.
Concepts of traits with an example
- Understanding traits in Scala
- The advantages of traits
- Linearization of traits
- The Java equivalent
- Avoiding of boilerplate code.
Scala Java Interoperability
- Implementation of traits in Scala and Java
- Handling of multiple traits extending.
- Introduction to Scala collections
- Classification of collections
- The difference between Iterator, and Iterable in Scala,
- Example of list sequence in Scala.
Mutable collections vs. Immutable collections
- The two types of collections in Scala
- Mutable and Immutable collections
- Understanding lists and arrays in Scala
- The list buffer and array buffer
- Queue in Scala
- Double-ended queue, Deque, Stacks, Sets, Maps, Tuples in Scala.
Use Case bobsrockets package
- Introduction to Scala packages and imports
- The selective imports
- The Scala test classes
- Introduction to JUnit test class
- JUnit interface via JUnit 3 suite for Scala test, p
- The packaging of Scala applications in Directory Structure
- Example of Spark Split and Spark Scala.
Apache Spark (Programming Language on Demand)
Writing Spark Applications using Scala
Spark framework comparing Scala
- Detailed Apache Spark, its various features
- Comparing with Hadoop
- The various Spark components
- Combining HDFS with Spark
RDD in Spark using Scala
- The RDD operation in Spark
- The Spark transformations
- Actions, data loading
- Comparing with MapReduce
- Key Value Pair.
Data Frames and Spark SQL using Scala
- The detailed Spark SQL
- The significance of SQL in Spark for working with structured data processing
- Spark SQL JSON support
Working with XML data, and parquet files
- Creating HiveContext
- Writing Data Frame to Hive
- Reading of JDBC files
- The importance of Data Frames in Spark
- Creating Data Frames
- Schema manual inferring
- Working with CSV files
- Reading of JDBC tables
- Converting from Data Frame to JDBC
- The user-defined functions in Spark SQL
- Shared variable and accumulators
- How to query and transform data in Data Frames
- How Data Frame provides the benefits of both Spark RDD and Spark SQL
- Deploying Hive on Spark as the execution engine.
Machine Learning using Spark (Mlib) using Scala
- Different Algorithms
- The concept of an iterative algorithm in Spark
- Analyzing with Spark graph processing
- Introduction to K-Means and machine learning
- Various variables in Spark like shared variables, broadcast variables
- Learning about accumulators.
Spark Streaming using Scala
- Introduction to Spark streaming
- The architecture of Spark Streaming
- Working with the Spark streaming program
- Processing data using Spark streaming
- Requesting count and Dstream
- Multi-batch and sliding window operations
- Working with advanced data sources.
Data Science Training Outcome:
The participant will be familiar with :
- Spark development using RDD
- Spark development using the data frame
- Spark development using streaming
- Spark development using mllib
- Spark development using Scala.