Learning Journey and Reflections on useR! 2022 Conference Part 3

By Jeremy Selva in useR! 2022

July 2, 2022

Introduction

In this narrative, I will continue sharing my learning journey during Day 4 of useR! 2022 Virtual Conference.

Day 4

First Keynote: Junior R-core Experiences

Day 4 of the conference started with a keynote by the R Core Team. The keynote was led by Sebastian Meyer who had recently joined the R Core team.

The first part of the presentation was about a summary of the major changes in R from version 4.1.3 version 4.2.0, such as code highlighting in HTML documentation and a new function Sys.setLanguage that is able to change the language of messages from R.

It was from this part of the keynote that I learnt that there was actually a search engine for R related functions from CRAN packages. Do take a look at the R blog for updates on the latest developments in R.

The second part was a sharing of a few bug report stories followed by some tips on anyone who wished to contribute and make R better. R actually had a website R’s Bugzilla for reporting potential bugs as well as a simple walkthrough of how to create a decent bug report. The presenter then requested for more people to test R, especially its pre-release versions, as well as to report to the R-core team old bug reports in R’s Bugzilla which might have already been resolved with a later version of R.

After the keynote session, the R Core Team then had a panel discussion.

Package Dependencies

After the panel discussion, I attended the first two talks in the Package Development session followed by the last two talks in Building the R Community 2.

The first talk on Package Development was from Zuguang Gu, developer of many packages such as ComplexHeatmap and spiralize. The presentation was about an R package called pkgndep used to calculate the dependency heaviness of an R package and display the results in a heatmap.

When developing R packages, it is advisable to keep it lightweight ( depend on a low number of external R packages ). This is mainly to allow users to install R packages easily by reducing the chances of failure due to an unsuccessful dependency installation. Thus, pkgndep is helpful for R package developers who wish to optimise their R package dependencies.

This is kind of similar to the R package dstr. The difference is that pkgndep provides a more detailed result in a heatmap rather than a network graph. On the other hand, dstr provides clearer advise to the R developers on what they can do next to reduce the dependencies. Perhaps, the teams of pkgndep and dstr can collaborate and build on each other strengths to further improve on the usefulness of their current creations.

pre-commits

The second talk conducted by Lorenz Walthert was a treat for me as I am unfamiliar with git pre-commit hooks. From my understanding, pre-commit hooks are widely used for Python developers due to the availability of of the Python package pre-commit to do all the heavy lifting. Motivated by how useful git pre-commit hooks are in improving code quality, Lorenz created a R package precommit in hopes that more R package developers would be able to adapt this workflow easily.

  • πŸ“ Slides
  • πŸ‘¨πŸ»β€πŸ« Demo

Transitioning To R

After the second talk, I jumped to the session on Building the R Community 2. Luckily, this time when I got in, the third talk of the session had just begun. It was a presentation by Kieran Martin from Roche. I have watched a R Consortium session conducted by his team before in YouTube titled Package Management at Roche and found it very informative and educational.

It was delightful to see him again presenting for useR! 2022. This time Kieran shared his experience on his large Product Development team’s ongoing journey to using R as one of the core data science tool, when there was a need to report results of clinical trials in the most efficient way. Best practices were shared like using Docker to fix the operating system, renv and internal package repositories to control the R packages used and the R Validation Hub to validate R packages. He also provided some useful advice and mistakes to avoid to help people grow into R.

What was very touching is the team’s desire in conjunction with like-minded pharmaceutical companies to create a shared set of high quality R packages for clinical reporting related analysis, plotting table and graphs via the pharmaverse and Insights Engineering. At least we do not have to see multiple R packages solving the same problem. Definitely something great to look into.

Reflections of a Research Software Engineer

The session was followed by a presentation by Nicholas Tierney, a research software engineer at the Telethon Kids Institute. Nicholas is also the developer of the two data visualisation R packages visdat and naniar.

The talk consisted of three parts: what kind of work he does as a research software engineer, a small summary of how he tried to maintain the greta R package ( as part of his job scope ) and the future of research software engineer in Australia.

I did gain some useful knowledge in package management such as creating better snapshot test, efficient pull request using pr_feteh and pr_finish from the usethis R package and using glue to make better command line messages to users.

Despite being busy with his research software engineer role, Nicholas is also active in the rOpenSci Social Coworking and Office Hours ( Asia Pacific Edition ) helping people with R related issues. It was actually from this online event when I first met him. He is really a nice and approachable R enthusiast.

Synthetic Data Generation

Honestly the last set of sessions was another hard choice for me as it covered topics that I was unfamiliar with. I took a leap of faith and attended the session on Synthetic Data and Text Analysis.

The first talk was about a small introduction of the simPop R package by Alexander Kowarik. It was an R package used to create synthetic data. Synthetic data generation is useful when there is a need to transform sensitive but complex original data to into privacy-compliant data that still retains the complex structure of the original data. The transformed data can then be used for train machine learning models, software testing and simulation studies. This was actually my first time getting to know about synthetic data and what it was used for. I was grateful for the speaker for giving such a clear presentation for me to understand.

textrecipes To Improve Preprocessing Of Textual Data

The second one switched to the topic of Text Analysis with the R package textrecipes. It was presented by Emil Hvitfeldt one of the authors of the e-book Supervised Machine Learning for Test Analysis in R. While I was new to text data analysis, I was able to understand the unique properties of such dataset and why was it so hard to transform such a dataset into meaningful numbers that a machine learning model could use to learn. The presentation then showed step by step on how textrecipes was able to make this transformation process less painful and yet flexible to user’s different needs.

As someone who had attended a book club on An Introduction to Statistical Learning with Applications in R ( ISLR ) conducted by the R4DS Online Learning Community, Emil’s ISLR tidymodels Labs notes had been very helpful to our cohort group’s learning journey to use tidymodels to better understand some of the statistical learning methods.

Bivariate Data Generation Using Scagnostics

The next presentation was led by Janith Wanniarachchi regarding his R package called scatteR. scatteR was used to generate a bivariate data set based on a given scagnostics measurement.

I enjoyed the flow of the presentation as it was like telling a story. The protagonist who faced a stumbling block along the way managed to overcome it after being inspired by ice-cream sprinkles. I won’t spoil the plot any further.

Janith is another first time presenter at an international conference. Do show him your support if you like his presentation.

Making Better Forecasting Models By Integrating Sentiment Analysis With Topic Modeling

The last talk of the session was quite hard for me to understand because I was unfamiliar with textual data analysis as mentioned before. Nevertheless, to the best of my knowledge, I will try to explain what is going on.

Oliver Delmarcelle showed how the R package sentopics could be used to integrating sentiment analysis and topic modeling of textual data to potentially make better forecasting models. In this presentation, the press conferences documents of the European Central Bank was used as an example.

If you are unfamiliar with the term topic modeling and sentiment analysis, take a look at these two Youtube videos by Julia Silge and Data Centric Inc. to know what they are.

The press conferences documents of the European Central Bank were first grouped into different dominant topics/themes, like inflation, economic growth and so on. Separately, sentiment analysis was applied on the press conferences documents to obtain two sentiment time series data, the sentiment of the Economic Condition and Monetary Policy over time.

With that, two forecasting models were constructed to see which one better predict the European Central Bank’s decision on the interest rate and monthly targets of the asset purchase program. They were a forecasting model using only the two sentiment predictors (Economic Condition and Monetary Policy) and another one with an additional topic-specific sentiment predictor. The result showed that the additional topic-specific sentiment predictor improves the forecasting model.
In addition, Oliver also showed how sentopics was able to create time series plots showing how each topic/theme contributed to the sentiment of the Economic Condition.

Second Keynote: Teaching Accessibly And Teaching Accessibility

Mine Dogucu, one of the authors of the e-book Bayes Rules! An Introduction to Applied Bayesian Modeling , delivered the last keynote of the day titled Teaching Accessibly and Teaching Accessibility. The talk consist of three parts: Teaching Accessibly, Teaching Accessibility and some recommendations for the community, especially those involved in pedagogy.

To make teaching resources more accessible and inclusive, Mine first suggested that such education materials should be open access (made available on the internet) as not all students are fortunate financially to be able to afford textbook material. In addition, students were usually the ones that contributes the most in improving the quality the of education materials by highlighting mistakes and providing alternative solutions to exercises. Such opportunities might missed out if the education materials remained only on print.

Even when the teaching material is available online, accessibility can be further expanded to the visually impaired with the use of alternative text and screen readers. Being a strong advocate for accessible teaching, Mine went the extra mile by opening a feature request for Rmarkdown to include an option of alternative text in 2020.

As the content of the Bayesian statistics e-book required a strong mathematics background to comprehend, Mine tried to make lessons more inviting for readers new to Bayesian statistics by complementing mathematical concepts with story telling, step by step computing instructions and the use relevant examples (Weather, Spotify data, Hotel bookings) that most lay person could relate. The goal is to encourage learners to embrace a growth mindset by learning from mistakes and not be discouraged when things are unclear when seen for the first time. Mistakes can still be good if it teaches you something. More ways to make teaching material accessible can be found the paper Framework for Accessible and Inclusive Teaching Materials for Statistics and Data Science Courses

When she had finished writing the e-book, Mine began to realise how little she knew about accessibility and was curious to know why this was so. She decided to look back into past teaching curriculum and data analysis tools to see how much focus on accessibility awareness and support were there. Unfortunately, most had little to no support. Knowing what needs to be done, she seeked the support of Teach Access and motivated individuals like JooYoung Seo to spread the need of accessibility skills in data science education for aspiring and experienced data scientist. Mine is currently working on designing curriculum to include accessibility awareness and use of assistive technologies in data science projects.

Mine proceeded with some technical details on how to use R to make data visualisation more accessible. Examples were

For image alternative text, take a look at Amy Cesal’s post on some good practice guidelines to describe plots with alternative text. One can also use the VI function from the BrailleR package to generate alternative text automatically. More resources on accessibility can be found in this presentation on accessible data science on the RStudio Global Conference 2021 by Mine’s collaborator, JooYoung Seo.

Mine concluded the keynote with many recommendations for the community. Here are some points that I have summarised below

  • Having alternative text in a picture is worth a thousand picture and should be treated with greater value than a diagram without one. Thus, writing alternative text as a git pull request is a good way to contribute to open source educational material.

  • Event organisers should take a more proactive approach by creating events/meetups that are more accommodating to the needs of individuals who need additional accessibility and not wait for them to make such a request before working on it.

  • Providing accessibility related support should not be a job assigned to volunteers alone. Event organisers must invest some resources in hiring professionals to lead and guide these volunteers and ensure that these services are implemented correctly.

  • Accessibility lessons should be taught to students as well as professionals, included in all programming languages, and applicable outside of the data science classroom. Given a (school or) work project, (students and) professionals should be (assessed and) held accountable for their accessibility practices.

  • Accessibility should be seen as a gift for everyone and not for a privileged few.

  • πŸ“ Slides

Conclusion

With that, I finished my participation in this online conference. A big thank you to the organisers for their hard work in making this event possible. It was indeed very fruitful. Looking forward to watch the recordings of the talks that I have missed.