Skip to content

Commit

Permalink
Updates Spark Connect cached output and html
Browse files Browse the repository at this point in the history
  • Loading branch information
edgararuiz committed Jun 22, 2024
1 parent c7c4d03 commit f560774
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 24 deletions.
4 changes: 2 additions & 2 deletions _freeze/deployment/spark-connect/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "ae13769542a54f9278cf001fa08660e4",
"hash": "366df3c9b915290e8e5a6a55ce8634d6",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Spark Connect\"\nformat:\n html:\n theme: default\n toc: true\nexecute:\n eval: true\n freeze: true\neditor: \n markdown: \n wrap: 72\n---\n\n\n\n\n*Last updated: Wed Jun 19 10:17:24 2024*\n\n## Intro\n\n[Spark\nConnect](https://spark.apache.org/docs/latest/spark-connect-overview.html)\nintroduced a decoupled client-server architecture that allows remote\nconnectivity to Spark clusters using the DataFrame API. The **separation\nbetween client and server allows Spark to be leveraged from everywhere**,\nand this would allow R users to interact with a cluster from the comfort\nof their preferred environment, laptop or otherwise.\n\n## The Solution\n\nThe API is very different than the \"legacy\" Spark and using the Spark\nshell is no longer an option. We have decided to use Python as the new\ninterface. In turn, Python uses *gRPC* to interact with Spark.\n\nWe are using `reticulate` to interact with the Python API. `sparklyr` extends \nthe functionality, and user experience, by providing the `dplyr`back-end, `DBI` \nback-end, RStudio's Connection pane integration.\n\n::: {#fig-connect}\n\n```{mermaid}\n%%| fig-width: 10\n%%| eval: true\nflowchart LR\n subgraph lp[test]\n subgraph r[R]\n sr[sparklyr]\n rt[reticulate]\n end\n subgraph ps[Python]\n dc[Databricks Connect]\n g1[gRPC]\n end\n end \n subgraph db[Databricks]\n sp[Spark] \n end\n sr <--> rt\n rt <--> dc\n g1 <-- Internet<br>Connection --> sp\n dc <--> g1\n \n style r fill:#fff,stroke:#666,color:#000\n style sr fill:#fff,stroke:#666,color:#000\n style rt fill:#fff,stroke:#666,color:#000\n style ps fill:#fff,stroke:#666,color:#000\n style lp fill:#fff,stroke:#666,color:#fff\n style db fill:#fff,stroke:#666,color:#000\n style sp fill:#fff,stroke:#666,color:#000\n style g1 fill:#fff,stroke:#666,color:#000\n style dc fill:#fff,stroke:#666,color:#000\n```\n\n\nHow `sparklyr` communicates with Databricks Connect\n:::\n\n\n## Package Installation\n\nTo access Databricks Connect, you will need the following two packages:\n\n- `sparklyr` - 1.8.4\n- `pysparklyr` - 0.1.3\n\n``` r\ninstall.packages(\"sparklyr\")\ninstall.packages(\"pysparklyr\")\n```\n\n## Initial setup\n\n`sparklyr` will need specific Python libraries in order to connect, and interact\nwith Spark Connect. We provide a convenience function that will automatically\ndo the following:\n\n- Create, or re-create, a Python environment. Based on your OS, it will choose\nto create a Virtual Environment, or Conda. \n\n- Install the needed Python libraries\n\nTo install the latest versions of all the libraries, use:\n\n```r\npysparklyr::install_pyspark()\n```\n\n`sparklyr` will query PyPi.org to get the latest version of PySpark\nand installs that version. It is recommended that the version of the\nPySpark library matches the Spark version of your cluster. \nTo do this, pass the Spark version in the `version` argument, for example:\n\n```r\npysparklyr::install_pyspark(\"3.5\")\n```\n\nWe have seen Spark sessions crash, when the version of PySpark and the version\nof Spark do not match. Specially, when using a newer version of PySpark is used\nagainst an older version of Spark. If you are having issues with your connection, \ndefinitely consider running the `install_pyspark()` to match that cluster's \nspecific Spark version.\n\n## Connecting\n\nTo start a session with a open source Spark cluster, via Spark Connect,\nyou will need to set the `master`, and `method`. The `master` will be an IP,\nand maybe a port that you will need to pass. The protocol to use to put\ntogether the proper connection URL is \"sc://\". For `method`, use\n\"spark_connect\". Here is an example:\n\n``` r\nlibrary(sparklyr)\n\nsc <- spark_connect(\n master = \"sc://[Host IP(:Host Port - optional)]\", \n method = \"spark_connect\"\n version = \"[Version that matches your cluster]\"\n )\n```\n\nIf `version` is not passed, then `sparklyr` will automatically choose the \ninstalled Python environment with the highest PySpark version. In a console \nmessage, `sparklyr` will let you know which environment it will use.\n\n## Run locally\n\nIt is possible to run Spark Connect in your machine We provide helper\nfunctions that let you setup, and start/stop the services in locally.\n\nIf you wish to try this out, first install Spark 3.4 or above:\n\n``` r\nspark_install(\"3.5\")\n```\n\nAfter installing, start the Spark Connect using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npysparklyr::spark_connect_service_start(\"3.5\")\n#> Starting Spark Connect locally ...\n#> org.apache.spark.sql.connect.service.SparkConnectServer running as process\n#> 12573. Stop it first.\n```\n:::\n\n\nTo connect to your local Spark Connect, use **localhost** as the address for \n`master`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n master = \"sc://localhost\", \n method = \"spark_connect\", \n version = \"3.5\"\n )\n#> ℹ Attempting to load 'r-sparklyr-pyspark-3.5'\n#> ✔ Python environment: 'r-sparklyr-pyspark-3.5' [315ms]\n#> \n```\n:::\n\n\nNow, you are able to interact with your local Spark session:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\ntbl_mtcars <- copy_to(sc, mtcars)\n\ntbl_mtcars %>% \n group_by(am) %>% \n summarise(mpg = mean(mpg, na.rm = TRUE))\n#> # Source: SQL [2 x 2]\n#> # Database: spark_connection\n#> am mpg\n#> <dbl> <dbl>\n#> 1 0 17.1\n#> 2 1 24.4\n```\n:::\n\n\nWhen done, you can disconnect from Spark Connect:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_disconnect(sc)\n```\n:::\n\n\nThe regular version of local Spark would terminate the local cluster\nwhen the you pass `spark_disconnect()`. For Spark Connect, the local\ncluster needs to be stopped independently.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npysparklyr::spark_connect_service_stop()\n#> \n#> ── Stopping Spark Connect\n#> - Shutdown command sent\n```\n:::\n\n\n## Additional setup details\n\nIf you wish to use your own Python environment, then just make sure to\nload it before calling `spark_connect()`. If there is a Python\nenvironment already loaded when you connect to your Spark cluster, then\n`sparklyr` will use that environment instead. If you use your own Python\nenvironment you will need the following libraries installed:\n\n- `pyspark`\n- `pandas`\n- `PyArrow`\n- `grpcio`\n- `google-api-python-client`\n- `grpcio_status`\n- `torch` *(Spark 3.5+)*\n- `torcheval` *(Spark 3.5+)*\n\nML libraries (Optional):\n\n- `torch`\n- `torcheval`\n- `scikit-learn`",
"markdown": "---\ntitle: \"Spark Connect\"\nformat:\n html:\n theme: default\n toc: true\nexecute:\n eval: true\n freeze: true\neditor: \n markdown: \n wrap: 72\n---\n\n\n\n\n*Last updated: Sat Jun 22 18:53:55 2024*\n\n## Intro\n\n[Spark\nConnect](https://spark.apache.org/docs/latest/spark-connect-overview.html)\nintroduced a decoupled client-server architecture that allows remote\nconnectivity to Spark clusters using the DataFrame API. The **separation\nbetween client and server allows Spark to be leveraged from everywhere**,\nand this would allow R users to interact with a cluster from the comfort\nof their preferred environment, laptop or otherwise.\n\n## The Solution\n\nThe API is very different than \"legacy\" Spark and using the Spark\nshell is no longer an option. We have decided to use Python as the new\ninterface. In turn, Python uses *gRPC* to interact with Spark.\n\nWe are using `reticulate` to interact with the Python API. `sparklyr` extends \nthe functionality, and user experience, by providing the `dplyr`back-end, `DBI` \nback-end, RStudio's Connection pane integration.\n\n::: {#fig-connect}\n\n```{mermaid}\n%%| fig-width: 10\n%%| eval: true\nflowchart LR\n subgraph lp[test]\n subgraph r[R]\n sr[sparklyr]\n rt[reticulate]\n end\n subgraph ps[Python]\n dc[Spark Connect]\n g1[gRPC]\n end\n end \n subgraph db[Compute Cluster]\n sp[Spark] \n end\n sr <--> rt\n rt <--> dc\n g1 <-- Internet<br>Connection --> sp\n dc <--> g1\n \n style r fill:#fff,stroke:#666,color:#000\n style sr fill:#fff,stroke:#666,color:#000\n style rt fill:#fff,stroke:#666,color:#000\n style ps fill:#fff,stroke:#666,color:#000\n style lp fill:#fff,stroke:#666,color:#fff\n style db fill:#fff,stroke:#666,color:#000\n style sp fill:#fff,stroke:#666,color:#000\n style g1 fill:#fff,stroke:#666,color:#000\n style dc fill:#fff,stroke:#666,color:#000\n```\n\n\nHow `sparklyr` communicates with Spark Connect\n:::\n\n\n## Package Installation\n\nTo access Spark Connect, you will need the following two packages:\n\n- `sparklyr` - 1.8.4\n- `pysparklyr` - 0.1.3\n\n``` r\ninstall.packages(\"sparklyr\")\ninstall.packages(\"pysparklyr\")\n```\n\n## Initial setup\n\n`sparklyr` will need specific Python libraries in order to connect, and interact\nwith Spark Connect. We provide a convenience function that will automatically\ndo the following:\n\n- Create, or re-create, a Python environment. Based on your OS, it will choose\nto create a Virtual Environment, or Conda. \n\n- Install the needed Python libraries\n\nTo install the latest versions of all the libraries, use:\n\n```r\npysparklyr::install_pyspark()\n```\n\n`sparklyr` will query PyPi.org to get the latest version of PySpark\nand installs that version. It is recommended that the version of the\nPySpark library matches the Spark version of your cluster. \nTo do this, pass the Spark version in the `version` argument, for example:\n\n```r\npysparklyr::install_pyspark(\"3.5\")\n```\n\nWe have seen Spark sessions crash when the version of PySpark and the version\nof Spark do not match. Specifically when a newer version of PySpark is used\nagainst an older version of Spark. If you are having issues with your\nconnection, consider running `install_pyspark()` to match the cluster's\nspecific Spark version.\n\n## Connecting\n\nTo start a session with an open source Spark cluster, via Spark Connect, you\nwill need to set the `master` and `method` values. The `master` will be an IP\nand maybe a port that you will need to pass. The protocol to use to put\ntogether the proper connection URL is \"sc://\". For `method`, use\n\"spark_connect\". Here is an example:\n\n``` r\nlibrary(sparklyr)\n\nsc <- spark_connect(\n master = \"sc://[Host IP(:Host Port - optional)]\", \n method = \"spark_connect\"\n version = \"[Version that matches your cluster]\"\n )\n```\n\nIf `version` is not passed, then `sparklyr` will automatically choose the \ninstalled Python environment with the highest PySpark version. In a console \nmessage, `sparklyr` will let you know which environment it will use.\n\n## Run locally\n\nIt is possible to run Spark Connect in your machin. We provide helper\nfunctions that let you setup and start/stop the services locally.\n\nIf you wish to try this out, first install Spark 3.4 or above:\n\n``` r\nspark_install(\"3.5\")\n```\n\nAfter installing, start Spark Connect using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npysparklyr::spark_connect_service_start(\"3.5\")\n#> Starting Spark Connect locally ...\n#> starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to\n#> /Users/edgar/spark/spark-3.5.1-bin-hadoop3/logs/spark-edgar-org.apache.spark.sql.connect.service.SparkConnectServer-1-edgarruiz-WL57.out\n```\n:::\n\n\nTo connect to your local Spark cluster using SPark Connect, use **localhost**\nas the address for `master`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsc <- spark_connect(\n master = \"sc://localhost\", \n method = \"spark_connect\", \n version = \"3.5\"\n )\n#> ℹ Attempting to load 'r-sparklyr-pyspark-3.5'\n#> ✔ Python environment: 'r-sparklyr-pyspark-3.5' [393ms]\n#> \n```\n:::\n\n\nNow, you are able to interact with your local Spark session:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\ntbl_mtcars <- copy_to(sc, mtcars)\n\ntbl_mtcars %>% \n group_by(am) %>% \n summarise(mpg = mean(mpg, na.rm = TRUE))\n#> # Source: SQL [2 x 2]\n#> # Database: spark_connection\n#> am mpg\n#> <dbl> <dbl>\n#> 1 0 17.1\n#> 2 1 24.4\n```\n:::\n\n\nWhen done, you can disconnect from Spark Connect:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nspark_disconnect(sc)\n```\n:::\n\n\nThe regular version of local Spark would terminate the local cluster\nwhen the you pass `spark_disconnect()`. For Spark Connect, the local\ncluster needs to be stopped independently:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npysparklyr::spark_connect_service_stop()\n#> \n#> ── Stopping Spark Connect\n#> - Shutdown command sent\n```\n:::\n\n\n## Additional setup details\n\nIf you wish to use your own Python environment, then just make sure to\nload it before calling `spark_connect()`. If there is a Python\nenvironment already loaded when you connect to your Spark cluster, then\n`sparklyr` will use that environment instead. If you use your own Python\nenvironment you will need the following libraries installed:\n\n- `pyspark`\n- `pandas`\n- `PyArrow`\n- `grpcio`\n- `google-api-python-client`\n- `grpcio_status`\n- `torch` *(Spark 3.5+)*\n- `torcheval` *(Spark 3.5+)*\n\nML libraries (Optional):\n\n- `torch`\n- `torcheval`\n- `scikit-learn`",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
Loading

0 comments on commit f560774

Please sign in to comment.