Skip to content

davew-msft/synapse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synapse End to End Workshop

https://github.com/davew-msft/synapse

(this repo has branches. master is currently set to fb)

Synapse Performance Notes Dockerized Spark Containers

What Technologies are Covered

  • Synapse workspaces
  • Synapse Spark

Target audience

  • Data Engineers/DBAs/Data Professionals
  • Data Scientists
  • App Developers

Workshop Agenda & Objectives

This is a tentative schedule. We may have to adjust given timelines, desires, etc

Day 1

  • Introductions/Objectives/Level-Setting

  • Synapse Navigation

  • Overview and Basic Setup

  • Data Lake organization

    • How is your data lake structured? Connecting to it from a notebook.
  • Data Sandboxing/Data Engineering

    • Querying data with SQL and pySpark, data pipelining principles
  • Basic ETL

    • Extract, Transform, Loading. Delta-formatted tables

Questions for Sean

  1. Will they have SQLSvrless?
  2. Can they hit opendatasets? and my datalake?
  3. Should we use their data?

Day 2

whoami = 'davew'
path = something + '/something.parquet'
df.write.parquet(path, mode = 'overwrite')
df.write.csv(csv_path, mode = 'overwrite', header = 'true')

df_parquet = spark.read.parquet(parquet_path)
df_parquet.show(10)

  • Delta Lake

    • what is it and how is it materialized in the lake?
    • what is a managed vs an unmanaged table.
  • Using Requirements.txt and shared libraries

    • %run wasn't working for them???
    • Excel
  • variables and notbook pipelining

  • Lab 020: Shared Metadata

  • Streaming Data

    • Real-time streaming data pipelines with Spark Structured Streaming. Take a batch process and make it stream. We start with the data already in Bronze
  • Streaming data from Kafka/Event Hubs

    • Streaming data using Event Hubs and Kafka.

Day 3

Other Possible Topics

  • Continuation of topics from Day 1 & 2
  • SHIR and Synapse pipelines to reach back on-prem
  • Streaming data from Kafka/Event Hubs
    • Streaming data using Event Hubs and Kafka.
  • Orchestrating and Administering Jobs
    • Streaming and Batch orchestration with Jupyter Notebooks and ADF
    • what is the overhead of doing that? (pipelining)
  • Data Science with Spark (YES) *
  • Performance Tuning and administration (they are their own DBAs)
    • Hyperspace
    • What are the common performance issues we see and what are the patterns to fix them?
  • Business Scenarios/Problem Solving
    • Tweet Analysis?
    • Cognitive Mistakes?
    • Social Media Analytics
    • Data Quality (automation and pipelining) do we have a set of notebooks/unit testing

ML/AI in Synapse

Monitoring

Wrap Up

You should probably delete the resource group we created today to control costs.

If you'd rather keep the resources for future reference you can simply PAUSE the dedicated SQL Pool and the charges should be minimal.

Other Notes

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published