Data Wrangling and Modelling with PySpark

Niveau
Tijdsduur
Startdatum en plaats
Logo van Data Science Workshops B.V.

Tip: meer info over het programma, startdatum, prijs, en inschrijven? Download de brochure!

Startdata en plaatsen

Utrecht
15 apr. 2020 tot 17 apr. 2020
check_circle Startgarantie
Toon rooster
event 15 april 2020, 09:00-16:00, Day 1
event 16 april 2020, 09:00-16:00, Day 2
event 17 april 2020, 09:00-16:00, Day 3

Beschrijving

Introduction

Apache Spark is an open-source distributed engine for querying and processing data. In this three-day hands-on workshop, you will learn how to leverage Spark from Python to process large amounts of data.

After a presentation of the Spark 2.0 architecture, we’ll begin manipulating Resilient Distributed Datasets (RDDs) and work our way up to Spark DataFrames. The concept of lazy execution is discussed in detail and we demonstrate various transformations and actions specific to RDDs and DataFrames. You’ll learn how DataFrames can be manipulated using SQL queries.

We’ll show you how to apply supervised machine learning models such as linear regression, logistic regression, decision …

Lees de volledige beschrijving

Veelgestelde vragen

Er zijn nog geen veelgestelde vragen over dit product. Als je een vraag hebt, neem dan contact op met onze klantenservice.

Nog niet gevonden wat je zocht? Bekijk deze onderwerpen: Data engineer, Data Science, Data storage, Data Vault en Data analyse.

Introduction

Apache Spark is an open-source distributed engine for querying and processing data. In this three-day hands-on workshop, you will learn how to leverage Spark from Python to process large amounts of data.

After a presentation of the Spark 2.0 architecture, we’ll begin manipulating Resilient Distributed Datasets (RDDs) and work our way up to Spark DataFrames. The concept of lazy execution is discussed in detail and we demonstrate various transformations and actions specific to RDDs and DataFrames. You’ll learn how DataFrames can be manipulated using SQL queries.

We’ll show you how to apply supervised machine learning models such as linear regression, logistic regression, decision trees, and random forests. You’ll also see unsupervised machine learning models such as PCA and K-means clustering.

By the end of this workshop, you will have a solid understanding of how to process data using PySpark and you will understand how to use Spark’s machine learning library to build and train various machine learning models.

What you’ll learn

  • Learn about Apache Spark and the Spark 2.0 architecture and its components
  • Work with RDDs and lazy evaluation
  • Build and interact with Spark DataFrames using Spark SQL
  • Use Spark SQL and DataFrames to process data using traditional SQL queries
  • Apply a spectrum of supervised and unsupervised machine learning algorithms
  • Handle issues related to feature engineering, class imbalance, bias and variance, and cross validation for building an optimal fit model

This workshop is for you because

  • You work with data regularly and want to be able to scale up the quantity of data processed.
  • You want to understand the methods specific to Spark for wrangling data.
  • You want to learn how to apply machine learning algorithms to large amounts of data.

Schedule

Day 1:

  • Introduction to Apache Spark
    • Setting up Spark
    • Spark fundamentals
    • Spark 2.0 Architecture
  • Resilient Distributed Datasets (RDDs)
    • Getting data into Spark
    • Actions
    • Transformations

Day 2:

  • DataFrames
    • Speeding up PySpark with DataFrames
    • Creating DataFrames
    • Interoperating with RDDs
    • Querying with the DataFrame API
  • Querying DataFrames with SQL

Day 3:

  • ML and MLLib packages
    • API Overview
    • Pipelines
    • Transformers
    • Estimators
  • Applying Machine Learning
    • Validation
    • Classification
    • Regression
    • Recommender system
  • Where to go from here

Prerequisites

Participants are expected to be familiar with the following Python syntax and concepts:

  • assignment, arithmetic, boolean expression, tuple unpacking
  • bool, int, float, list, tuple, dict, str, type casting
  • in operator, indexing, slicing
  • if, elif, else, for, while
  • range(), len(), zip()
  • def, (keyword) arguments, default values
  • import, import as, from import ...
  • lambda functions, list comprehension
  • JupyterLab or Jupyter Notebook

Some experience with Pandas and SQL is useful, but not required.

Recommended preparation

Participants are kindly requested to have the following items installed prior to the start of the workshop:

  • Docker Desktop for Windows or for Mac or for Ubuntu
  • The docker image, by running: docker pull jupyter/pyspark-notebook

More detailed installation instructions will be provided by email after signup.

Clients

We’ve previously delivered this workshop at:

  • KPN ICT Consulting
  • ProRail
  • Textkernel

Testimonials

“Our DataLab team enjoyed a three-day PySpark course from Jeroen. Jeroen’s approach is personal and professional. I recommend Data Science Workshops to anyone in the field of data science.”

–Laurens Koppenol, Lead Data Scientist, ProRail

Blijf op de hoogte van nieuwe ervaringen

Er zijn nog geen ervaringen.

Deel je ervaring

Heb je ervaring met deze cursus? Deel je ervaring en help anderen kiezen. Als dank voor de moeite doneert Springest € 1,- aan Stichting Edukans.

Er zijn nog geen veelgestelde vragen over dit product. Als je een vraag hebt, neem dan contact op met onze klantenservice.

Download gratis en vrijblijvend de informatiebrochure

Aanhef
(optioneel)
(optioneel)
(optioneel)
(optioneel)
(optioneel)
(optioneel)
(optioneel)
(optioneel)

Heb je nog vragen?

(optioneel)
We slaan je gegevens op om je via e-mail en evt. telefoon verder te helpen.
Meer info vind je in ons privacybeleid.