Group meeting schedule, Fall 2023
Time 2-3pm on Wednesday afternoons in Building 90 (SICCS), second floor meeting room (TBD)
What to present?
- your research.
- somebody else’s paper.
- some useful software.
- Aug 30: Toby.
- Title: Using the NAU Monsoon super computer cluster for research.
- Abstract: Monsoon is a super computer cluster which gives 4000 CPUs to NAU researchers (including graduate students) for free use. In this talk I will explain how to use it for computational research, using examples from my NSF POSE project about expanding the open-source ecosystem around R data.table. First, NAU Monsoon was used to implement a daily report that summarizes reverse dependency check issues for R data.table, which simplifies release management since data.table developers no longer have to do these time-consuming results on their personal computers (compiling nightly R-devel + checking more than 1400 packages). Second, NAU Monsoon was also used to implement asymptotic benchmarks for CSV read/write functions, which allowed identifying and fixing performance bugs in base R (quadratic time reduced to linear time).
- Cross-validation experiments on the NAU Monsoon cluster using python and R.
- data.table revdep-checks on NAU Monsoon, wiki instructions, analyze results directory, source code.
- asymptotic timings of data.table and similar R functions, reproducible results directory, atime R package, issues, discussion on R-devel email list.
- Sep 6: Bilal, Title: Discussion about what I have done so far in my research and way forward. I will be discussing my AADT regression model and its results. Also, I will show result from my object based classification model to detect the vehicles. TODO post slides please. Generalizing to new subsets blog post
- Sep 13: Doris, Title: My Experience. I will be talking about where I use to work in Ghana, what they do and how i contributed to the industry. Slides:MySlides
- Sep 20: Richard: predicting agriculture from satellite data.
- Sep 27: Danny, I will talk mainly about my new results on my research which is titled “Cross-validation for Training and Testing Co-occurrence Network Inference Algorithms”. Link to slides
- Oct 4: Trevor, related papers. Slides:My Slides
- Oct 11: Toby, google slides, a brief tutorial about R data.table, in preparation for my tutorial at the LatinR conference on 18 Oct. (see below)
https://latin-r.com/cronograma/tutoriales/#using-and-contributing-to-the-data.table-package-for-efficient-big-data-analysis
Description: data.table is one of the most efficient open-source in-memory data manipulation packages available today. First released to CRAN by Matt Dowle in 2006, it continues to grow in popularity, and now over 1400 other CRAN packages depend on data.table. This three hour tutorial will start with data reading from CSV, discuss basic and advanced data manipulation topics, and finally will end with a discussion about how you can contribute to data.table. In each part of the tutorial, you will be asked to solve a few exercises, to practice each new concept.
Outline:
- Efficiently reading CSV files into R: fread
- Introducing general form of a data.table query - DT[i, j, by], or for those familiar with SQL: DT[where, select|update, group by]
- Subsets and joins - exploring similarities between subsets and joins is key to understanding how data.table works.
- Fast and flexible grouped aggregations and updates
- quick look at other new and useful features in the recent releases, including CSV file writer (fwrite) and rolling/non-equi joins.
- Using data.table in your own package that you want to put on CRAN.
- Contributing to data.table via issues, pull requests, translations, travel awards, community governance.
- Oct 18: Toby out of town, presenting data.table tutorial at LatinR conference.
- Oct 25: Karl, TBD.
- Nov 1: Bilal, practice exam. I will be defending my Candidacy Exam on Thursday, November 16. During the lab, I will be conducting a practice talk that I intend to present on the exam day.
- Nov 8: Interactive data viz discussion.
- Bilal: prediction accuracy of traffic between regions.
- Toby: Area Under the ROC curve for change-point detection.
- Danny: Interactions between microbes.
- Nov 15: Cam, google slides PING-Mapper: Reproducible Substrate Mapping with Recreation-grade Sonar Systems. Predictive understanding of the variation and distribution of substrates at large spatial extents in aquatic systems is severely lacking. This hampers efforts to numerically predict the occurrence and distribution of specific benthic habitats important to aquatic species, which must be observed in the field. Existing survey methods are limited in scale, require heavy and technically sophisticated survey equipment, or are prohibitively expensive for surveying and mapping. Recreation-grade side scan sonar (SSS) instruments, or fishfinders, have demonstrated their unparalleled value in a lightweight and easily-to-deploy system to image benthic habitats efficiently at the landscape-level. Existing methods for generating geospatial datasets from these sonar systems require a high-level of interaction from the user and are primarily closed-source, limiting opportunities for community-driven enhancements. We introduce PING-Mapper, an open-source and freely available Python-based software for generating geospatial benthic datasets from recreation-grade SSS systems. PING-Mapper is an end-to-end framework for surveying and mapping aquatic systems at large spatial extents reproducibly, with minimal intervention from the user. Version 1.0 of the software (Summer 2022) decodes sonar recordings from any existing Humminbird® side imaging system, export plots of sonar intensities and sensor-derived bedpicks and generates georeferenced mosaics of geometrically corrected sonar imagery. Version 2.0 of the software, to be released Summer 2023, extends PING-Mapper functionality by incorporating deep neural network models that automatically locate and mask sonar shadows, calculate independent bedpicks from both side scan channels, and classify substrates at the pixel level. The widespread availability of substrate information in aquatic systems will inform fish sampling efforts, habitat suitability models, and planning and monitoring habitat restoration.
- Nov 22: Danny: Using Monsoon in ML workflow, code used in presentation is available on this repository.
- Nov 29: Doris: Pesenting my research topic for expanding the open source ecosystem around data.table in R. Over the next two years, I will be focusing on enhancing the functionalities, improving documentation,etc. Find the slides to this presentation here
- Dec 6: Trevor
- Date TBD Paul, experience doing regression analysis as an analyst in higher education.
- Date TBD. ROOM CHANGE: SICCS 102. LLNL postdoc Kowshik Thopalli <[email protected]> invited talk Title: Improving Out-of-distribution Generalization of Deep Vision Models: Test-time and Train-time Adaptation Strategies. Abstract: This talk will focus on understanding and improving the robustness of deep vision models when training and testing data are drawn from different distributions To overcome this challenge, a common protocol has emerged, which involves adapting models at test-time with limited target data. I will first discuss a novel “align then adapt” approach that effectively exploits latent space to adapt models without requiring access to source data. This method is highly effective across multiple architectures, complex distributions shifts, and modalities. However, in cases where extremely limited target data is available, test-time approaches may fail. I will consider the extreme case where only one target reference sample is available and present our efforts in designing effective generative augmentation protocol that involves finetuning a generator with single-shot data and developing a source-free adaptation protocol with the synthetic data. Finally, in certain practical scenarios test-time adaption can be cumbersome and often we don’t have access to any target data. To address this, I will present a train-time protocol that utilizes data from multiple source domains to build generalizable computer vision models through a novel meta-learning paradigm. This approach is theoretically motivated and achieves excellent performance on more than 6 different datasets compared to a several state-of-the-art baselines. Thus, through this talk, I will present a suite of techniques to improve the out-of-distribution generalization of models for various computer vision tasks.