Skip to content

Commit

Permalink
GH-16338: Added group by clarification R example for gb.control (#1…
Browse files Browse the repository at this point in the history
…6404)

* ht/base example

* ht/added prediction

* ht/code test & example headers

* ht/fixed rendering
  • Loading branch information
hannah-tillman authored Oct 15, 2024
1 parent 99aafb0 commit d97dca2
Showing 1 changed file with 61 additions and 2 deletions.
63 changes: 61 additions & 2 deletions h2o-docs/src/product/data-munging/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,10 @@ In addition to the above parameters, any number of the following aggregations ca

Once the aggregation operations are complete, calling the GroupBy object with a new set of aggregations will yield no effect. You must generate a new GroupBy object in order to apply a new aggregation on it. In addition, certain aggregations are only defined for numerical or categorical columns. An error will be thrown for calling aggregation on the wrong data types.

Example
-------
Examples
--------

The following examples in Python and R show how to find the months with the highest cancellation using ``group_by``.

.. tabs::
.. code-tab:: python
Expand Down Expand Up @@ -212,3 +214,60 @@ Example
4 ALB 3646 49 50
5 AMA 317 4 6
6 ANC 100 0 1

The following R code shows the options by-variable with ``gb.control``.

.. tabs::

.. code-tab:: r R

# Import H2O-3:
library(h2o)
h2o.init()

# Import the airlines dataset:
airlines.hex <- h2o.importFile("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")

# View quantiles and histograms:
quantile(x = airlines.hex$ArrDelay, na.rm = TRUE)
h2o.hist(airlines.hex$ArrDelay)

# Find the number of flights by airport:
originFlights <- h2o.group_by(data = airlines.hex, by = "Origin", nrow("Origin"), gb.control <- list(na.methods = "rm"))
originFlights.R <- as.data.frame(originFlights)

# Find the number of flights per month:
flightsByMonth <- h2o.group_by(data = airlines.hex, by = "Month", nrow("Month"), gb.control <- list(na.methods = "rm"))
flightsByMonth.R <- as.data.frame(flightsByMonth)

# Find months with the highest cancellation ratio:
which(colnames(airlines.hex)=="Cancelled")
cancellationsByMonth <- h2o.group_by(data = airlines.hex, by = "Month", sum("Cancelled"), gb.control <- list(na.methods = "rm"))
cancellation_rate <- cancellationsByMonth$sum_Cancelled/flightsByMonth$nrow
rates_table <- h2o.cbind(flightsByMonth$Month, cancellation_rate)
rates_table.R <- as.data.frame(rates_table)

# Construct test and train sets using sampling:
airlines.split <- h2o.splitFrame(data = airlines.hex, ratio = 0.85)
airlines.train <- airlines.split[[1]]
airlines.test <- airlines.split[[2]]

# Display a summary using table-like functions:
h2o.table(airlines.train$Cancelled)
h2o.table(airlines.test$Cancelled)

# Set the predictor and response variables:
Y <- "IsDepDelayed"
X <- c("Origin", "Dest", "DayofMonth", "Year", "UniqueCarrier", "DayOfWeek", "Month", "DepTime", "ArrTime", "Distance")

# Define the data for the model and display the results:
airlines.glm <- h2o.glm(training_frame = airlines.train, x = X, y = Y, family = "binomial", alpha = 0.5)

# View the model information (training statistics, performance, important variables):
summary(airlines.glm)

# Predict using the GLM model:
pred <- h2o.predict(object = airlines.glm, newdata = airlines.test)

# Look at the summary of predictions (probability of TRUE class p1):
summary(pred$p1)

0 comments on commit d97dca2

Please sign in to comment.