Group meeting schedule, Fall 2021

Mondays 3-4pm, second floor meeting room in building 90 (SICCS).

your research.
somebody else’s paper.
some useful software.

Aug 16: Toby, Talk 1: How I organize my research projects. Talk 2: Recent advances in supervised optimal changepoint detection. Abstract: Changepoint detection involves detecting abrupt changes in data sequences measured over space or time. Although there are many classical unsupervised algorithms for this problem, several of my recent research projects have involved proposing new supervised labeling methods and optimization algorithms. In this talk I will give an overview of these recent advances, with a discussion of applications in various problem domains (medicine, genomics, neuroscience).
Aug 23: Akhila Slides, Blog.

Title: RcppDeepState, a simple way to fuzz test Rcpp packages.

Abstract: R packages written in the widely used Rcpp framework are typically tested using expected input/output pairs that are manually coded by package developers. These manually written tests are validated under various CRAN checks, using both static and dynamic analysis.Such manually written tests allow for subtle bugs since they do not anticipate all possible inputs and miss important code paths.Fuzzers pass random, unexpected, potentially invalid inputs to a function, in order to identify bugs missed by manually written tests.This paper presents RcppDeepState, an R package that uses the DeepState framework to provide automatic fuzzing and symbolic execution for R packages written using the Rcpp framework.Using RcppDeepState, a package developer can systematically fuzz test their Rcpp functions, without having to manually write any inputs nor expected outputs.Randomly generated inputs are passed to each Rcpp function, and Valgrind is used to check for various memory access violations and memory leaks.In our system, a single test harness C++ code file can be used to fuzz test an Rcpp function using different backend fuzzers including AFL, libFuzzer, and HonggFuzz.For even more flexibility, R package developers can write their own random generation functions and assertions.We implemented random generation functions for 8 of the most common Rcpp data types, then used these functions to fuzz test 1,185 Rcpp packages. Valgrind reported issues for more than 2,000 functions (over nearly 500 packages) that were not detected using standard CRAN checks on manually specified test/example inputs.

Sept 3: Cam Slides

Title: “Automated mapping of Gulf Sturgeon spawning habitat in Coastal Plain Rivers: Case study in the Pearl and Pascagoula River Systems” Abstract: Gulf sturgeon (Acipenser oxyrinchus desotoi) are a threatened anadromous fish species with spawning populations found in seven Gulf of Mexico coastal plain rivers (east to west: Suwannee, Apalachicola, Choctawhatchee, Yellow, Escambia/Conecuh, Pascagoula, and Pearl/Bogue Chitto rivers). Relatively little is known about spawning habitats located in the Pearl and Pascagoula river systems as only one confirmed spawning location has been identified. Characterizing, locating, and quantifying existing suitable spawning habitat is critical to species recovery and restoration efforts. This research effort will locate and quantify suitable spawning habitat by developing tools and software for processing side scan sonar data and apply (semi-) automated procedures for benthic feature classification, assess spawning habitat accessibility and use patterns, and synthesize data needed to evaluate and prioritize spawning habitat restoration projects.

Sept 10: Kyle

https://www.spiedigitallibrary.org/journals/journal-of-applied-remote-sensing/volume-15/issue-3/038503/Machine-learning-based-region-of-interest-detection-in-airborne-lidar/10.1117/1.JRS.15.038503.full

https://www.spie.org/news/machine-learning-speeds-up-lidar-analysis-for-fishery-surveys?SSO=1 - brief news article Title : Airborne lidar data for fishery surveys often do not contain physics-based features that can be used to identify fish; consequently, the fish must be manually identified, which is a time-consuming process. To reduce the time required to identify fish, supervised machine learning was successfully applied to lidar data from fishery surveys to automate the process of identifying regions with a high probability of containing fish. Using data from Yellowstone Lake and the Gulf of Mexico, multiple experiments were run to simulate real-world scenarios. Although the human cannot be fully removed from the loop, the amount of data that would require manual inspection was reduced by 61.14% and 26.8% in the Yellowstone Lake and Gulf of Mexico datasets, respectively.

Sept 17: Shadmaan

Slides, Paper

Title: Extraction of Sequence from Bangla Handwritten Numerals and Recognition Using LSTM

Abstract For handwritten numeral recognition (HNR), fewer explorations have been done on Bangla numerals compared to other languages. Among the existing methods, several Convolutional Neural Network (CNN) based methods outperformed other methods. But CNN always gets confused with some specific Bangla numerals due to the similarity of shape and size of different numerals. The main purpose of this study is to expand Bangla HNR by considering a novel methodology with a Long Short-Term Memory (LSTM) network. In the proposed method, images are thinned and a sequence is extracted. These extracted sequences are used to classify using the LSTM network. Both single-layer LSTM and Deep LSTM models are trained and performance tested on a benchmark dataset.

Experimental outcomes revealed that the proposed LSTM based method outperformed CNN with lower misclassification rates for the similarly shaped numerals. Finally, the proposed method achieved a test set recognition rate of 98.03% which is better than or competitive to other prominent existing methods.

Sept 24: NO MEETING (SHERC seminar, Introduction to Computing at NAU)
October 1: Tristan, PeakLearner: Interactive Genome Analysis, https://docs.google.com/presentation/d/1XCshmQZktkpBJ4fRB6eqRUt-RWNadH8DH6oLB11iQe4/edit?usp=sharing
Oct 8: Brooke

Abstract: Title: Analyzing Label Data on Contigs Using data.table

Data.table is a function and a package unique to the R-coding language that allows the user to download and create a data table from a file. Data.table has various input values that allow flexibility for the user and can perform various analysis functions and manipulations. Data.table can be used, for example, to identify the number of contigs per chromosome, and allows for comparative analysis of the output data. My current research utilizes data.table to analyze labels created on PeakLearner, a machine learning system. This allows us to identify peaks on a cell sample and allows geneticists to determine where there are specific binding sites of proteins on DNA. We can then analyze our current algorithm to determine how many labels have been created per contig, and allow us to determine the efficacy of our model. Slides : https://docs.google.com/presentation/d/1z5eN88HSpeMgJ0g4oZF4Z91GS9F5eSav9qxbN83URGY/edit?usp=sharing

Oct 15: Toby, Introduction to NAU’s Monsoon supercomputer. Abstract: Supercomputers are useful for machine learning experiments where there are lots of different data sets, algorithms, or hyper-parameters to try. The advantage of a super-computer is that each combination can be computed in parallel on a different CPU/GPU, and since the cluster has 100s-1000s of cores, you can get results 100s-1000s times faster. In this tutorial I will explain how to setup, launch, and interpret the results from such experiments. R batchtools on Monsoon, X forwarding on windows, Monsoon OnDemand for Rstudio in a web browser.
Oct 22: Kyle - Ethical discussion of UMN gettting banned from the Linux kernel

https://docs.google.com/presentation/d/1Gky0qKWYgmbzTs3CUuMAv9gI7dnMzNenfalyZj8Yejw/edit?usp=sharing

Oct 29: Cam, Title: Computing accurate depth estimates from side scan sonar data with machine learning: two case studies.

Abstract: Side scan sonar is a technology used to image physical conditions in aquatic systems (lakes, rivers, oceans, estuaries, etc.) which can then be georectified to spatially locate features (rock outcrops, ship wrecks, aquatic plants, etc.) in a Geographic Information System (GIS). Accurate depth estimates from the sonar instrument to the aquatic bed are needed to remove the water column from the sonar image for accurate georectification. The sonar system estimates the depth from a down-facing sonar, but these estimates are often inaccurate due to objects suspended in the water column, interference from the survey vessel, or high water turbidity. This presentation will provide a primer on side scan sonar image characteristics, factors affecting depth estimates, the importance of accurate estimates, and two approaches for automatically estimating depth using machine learning from recent publications1,2.

1: Yan, J., Meng, J., & Zhao, J. (2021). Bottom detection from backscatter data of conventional side scan sonars through 1d-unet. Remote Sensing, 13(5). https://doi.org/10.3390/rs13051024

2: Zheng, G., Zhang, H., Li, Y., & Zhao, J. (2021). A Universal Automatic Bottom Tracking Method of Side Scan Sonar Data Based on Semantic Segmentation. Remote Sensing, 13(10), 1945. https://doi.org/10.3390/rs13101945

Nov 5: Toby, Title: Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection. Abstract: Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines. Pre-print https://arxiv.org/abs/2107.01285, slides https://github.com/tdhock/max-generalized-auc/raw/master/HOCKING-slides.pdf
Nov 12: Shadmaan

Slides

Title: Evaluation of Features Influencing School Growth through Predictive Modeling

Abstract Predictive modeling is a statistical technique using machine learning and data mining that works by analyzing historical and existing data and generating a model to predict likely future outcomes. In the study, various features of schools in different districts of Missouri state are provided as the data. The main objective is to analyze the significant features influencing school growth through the application of machine learning models. The research also proposes a comprehensive comparison of the models with baseline models.

This presentation will outline a preliminary research methodology for calculating the feature importance value. This approach employs linear regression models namely the Lasso regression model, Ordinal Forest, Ordinal net, and others. The preliminary findings give considerable insight into the factors that influence school growth.

Nov 19: Brooke

Title: Ggplot and data visualization

Abstract: Ggplot is an r function that allows the user to create a visual plot from a set data table. This can be used to create a visual representation of the data to allow for visual analysis. Ggplot can be used to visualize errors in a code, such as false peak detections. Minimizing false peak detections maximizes the efficiency of the code, so this is an important area to look into. Visualizing errors creates an understanding of what is going on at certain error points and thus allows for analysis of those errors. Link to slides: https://docs.google.com/presentation/d/1FH9njiCun3SQ5krZ4sKkXaJIAmD5ciG7d24VuQW8L2g/edit?usp=sharing

Nov 26: NO MEETING (day after thanksgiving)
Dec 3: Anirban Chetia, TBD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.org

README.org

Files

README.org

Latest commit

History

README.org

File metadata and controls