Learning Journey and Reflections on useR! 2022 Conference Part 2

By Jeremy Selva in useR! 2022

July 1, 2022

Introduction

In this narrative, I will continue sharing my learning journey during Day 3 of useR! 2022 Virtual Conference.

Day 3

First Keynote: afrimapr

The first keynote was by the afrimapr project team. The project aims is to use R as a building block for many things such as better management of data realted to Africa, creation of open source analytical tools to analyse and provide insights across Africa, providing teaching resources and training for those interested in this novel work and building a strong community with this common interest.

Anelda van der Walt shared the progress the team had made since it was first launched in 2020 such as R packages, interactive apps, teaching materials and even publications of data in journals.

The next phase of the talk was a sharing from Clinton Nkolokosa about his experiences as a geospatial researcher doing projects such as malaria incident, flood incidence and covid 19 cases. Such meaningful accomplishments were hard to obtain as he highlighted the challenges faced such as difficulties in getting up-to-date data ( Clinton ended up having to update the data on his own ) and finding local talented people ( R gurus ) doing similar work for help and consultation due to lack of funding and opportunities. He stressed the importance of more individuals from the African R community to step out of their comfort zones and be drivers of innovative ideas and resources.

The following phase of the keynote was mainly focus on effective teaching. Anne Treasure presented the usefulness of learnr to create online tutorials to support novice learners of R. She then showed how the team went one step further by making these material more accessible such as

  • Teaching them in French, besides English.
  • Use of RStudio Cloud so that more people could participate on the courses simultaneously.
  • Making materials downloadable for local use.
  • Pre and post course survey to understand the attendee’s concern for improvement of future teaching sessions.

Nono Gueye then showed how having teaching resources in other language such as French can benefit more people in multilingual Africa and gave some existing R resources in French such as tidyverse.

The last phase of the keynote was about the status of R communities in Africa. Anelda showed some statistics that the R communities in Africa when compared to North America was not doing as well with a significantly lower number of active groups and organised events. This implied that the knowledge transmission of R from one individual to another in Africa was quite low.

Anelda then gave a few reasons to explain why it was hard to build and sustain an R community. One reason was that building one requires a lot of skills and hard work which was daunting if the organising team just have a few people. Another reason was that the people who built such communities were mainly volunteers who might be overburdened by work and/or family commitments to run regular events. Low participate rate in such organised event did bring a mental toll on the organisers as well. The last reason was a lack of funding and resources.

The keynote concluded first by appealing to the African R community to support more new R users locally, collaborate locally and internationally and try to experiment new established approaches to see if it helped or not. It then appealed towards the global (funding) community to find ways to reward people for their time to run events, invest more in multilingual materials as well as to identify and support locally-led, custom-developed initiatives that worked in emerging regions.

Poster Presentation

renderthis

I first entered the Dissemination of Information room because I was curious about the renderthis presented by John Paul Helveston. It was formally called xaringanBuilder, one of the extension of Xaringan.

For those who are new to Xaringan, it is an alternative way to create presentation slides in html. A decent quick start guide can be found in Silvia Canelón’s blog post Deploying xaringan Slides with GitHub Pages. Here is a link to the Xaringan gallery for other examples. To my knowledge, the other extension packages are xaringanthemer and xaringanExtra.

It was eye opening to see how many different ways Xaringan slides could be converted to using renderthis. It went from the usual html file, pdf file to even png images. Currently, I was following the instructions in Garrick Aden-Buie’s blog post Printing xaringan slides with chromote to output the Xaringan slides into pdf. Maybe I should try to see if I could do the same with renderthis.

With renderthis, users could also specify which Xaringan slides to be converted to. Future work involved able to output the Xaringan slides as a Microsoft Powerpoint 3 slides per page handout.

animate

I jumped to Data Visualisation room later after that.

I was looking forward for the presentation on the R package animate by Jackson Kwok. You see, I was reading the poster in Micosoft Powerpoint beforehand and my jaws dropped when I saw that the animation featured in this article titled The New Science of Sentencing could be replicated using R. While reading the documentation of animate, I found out Jackson also made some Xaringan slides introducing the R package animate. Do take a look if you need a more information. Unfortunately, when I entered the virtual lounge. It just ended…

Polished Annotations

Putting this aside, the next presentation was by Cara Thompson sharing useful tips and trick to make polished annotations in ggplot. Making annotations in ggplot had always been really challenging for me. I usually spent a lot of time tweaking fixed parameters to move the annotations from one place to another. Most of the time I just gave up and used Microsoft Powerpoint to add these extra text boxes.

Cara’s success in finding several clever strategies to make polished annotations, using ggtext, glue, CSS formatting and a filtered tibble with unique arrow curvature information, did make me smile. Her tips could also be found in this talk post. One of her blog post Alignment cheatsheet may be useful for those who need a little help on the alignment parameters.

Two-Sample Corrgram

The last presentation in the Data Visualisation room was by Rohan Tummala. This visualisation is a unique way to optimise the correlation results of dichotomous data within the space of a 2D heatmap matrix. More information could be found in this project webpage.

It definitely has a lot of potential to be improved by the R community. Here are my “5-cent” suggestions. The team may wish to use the R package correlation from the easystats team, to expand their current correlation methods of Pearson, Kendall and Spearman.

As for the question raised on extending the static plot to multichotomous data (or k-sample corrgram), one idea I can think of currently is to adopt the scatter plot matrix style where the diagonal are the groups and the scatter plots are replaced with the two sample corrgrams instead. However, this in turn creates the same redundant space which the two-sample corrgram is supposed to prevent.

Regardless, if this visualisation continue to receive good feedback, someone will create an interactive two-sample corrgram and push it as a web tool.

pacs

I next entered the R in Production room where I was first introduced to the R package pacs by Maciej Nasinski. It contained a set of useful utility functions for helping R package developers life easier such as, automating validation of a renv lock file, finding out which are packages not from CRAN, etc… More information could be found in its documentation.

svgtools

The second sharing was from Konrad Oberwimmer who introduced the R package svgtools that was able to key in statistical results onto charts template made in SVG file format. This is helpful if there is a need to create graphs that needs to follow a certain layout based on corporate needs.

data.validator

The next showcase was about data.validator by Marcin Dubel. Honestly, this is an R package that I wish I knew about earlier in my data analysis journey. While it is great to have tools that validate the data but it is even better when a validation report, highlighting the possible issues of a given dataset clearly and explicitly, can be created and distributed to others.

data.validator not only used assertr to do the validation but is able to create an interactive report in html as well. More details could also be found in this youtube video as well as this R-bloggers post

Continuous Integration In GitLab For R

Lastly, it was a workflow presented by Cody Marquart on how to manage R packages automatically using GitLab Continuous Integration (CI). GitLab CI was first introduced from an April 2020 GitLab meetup. It was good to know that CI is possible to be applied on R packages management in GitLab.

Usually, CI was done using Github Actions from Github because of many useful resources available to help users create CI tasks easily. Thus, Github Actions became a popular choice for R package management.

Nevertheless, as it is unwise not to have a backup plan, being aware of that an alternative workflow exists is critical. Take a look at one of GitLab CI examples used in the R package shinyLogger Hopefully, I would get to witness a similar workflow for managing R packages using Bitbucket in the future.

Clinical Data Review Reporting Tool

Day 3 for me was filled with many sessions that I was interested in like Data Visualization, Publishing and Reproducibility as well as Web Frameworks. Unfortunately, I could not join all at the same time. After some thoughts, I had decided to attend the session on Publishing and Reproducibility.

The first talk was from Laure Cougnaud from OpenAnalytics introducing an R package clinDataReview used to provide a medical report in bookdown, filled with interactive plots to understand how patients were doing during a clinical study. An example of this report could be found in this link The report was created as a folder of mainly HTML (and other) files which was then distributed to the clinicians. The key to making this possible was the use of a standard template report in R markdown as well as a configuration file in YAML for users to set specific parameters to make the report customisable and flexible to their needs.

I was first introduced to Rmarkdown report templates during my attendance in the RStudio Conference 2020 at San Francisco. It was presented by Sharla Gelfand titled Don’t repeat yourself, talk to yourself! Reporting with R. I highly recommend watching as it is really down to Earth and funny at the same time. Sharla also provided a blog post showing step by step how to produce an introductory R package with Rmarkdown report templates.

The work done by Laure Cougnaud and her team was bringing this workflow to the next level by extending from Rmarkdown to bookdown reports. Definitely a job well done.

knitr engines And blogdown Troubleshooting

The next talk was by Christophe Dervieux giving a introductory tour on the different knitr engines. The knitr engines can be listed in R using the command names(knitr::knit_engines$get()). More information of commonly used engines can be found in the Rmarkdown e-book.

I was then introduced the latest knitr engine called exec which allow command lines to be executed in the Rmarkdown code chunk. The talk also guided users on how to create custom knitr engines as well.

One major takeaway for me on this talk was the that there was actually a Github link showing Rmarkdown examples of many different knitr engine, including the latest exec engine found here.

After enjoying the tour on knitr engines, the session proceeded with Yihui Xie sharing some helpful summary on how to create a blog using blogdown with minimal issues using blogdown::serve_site() and blogdown::check_site(). Hopefully the blogdown::check_site() could be part of the RStudio Addins for blogdown in the future.

For me, I started to learn how to create this Hugo Apéro theme blogdown site from watching a YouTube video of a lesson conducted by Alison Hill’s in R Ladies Tunis. However, if a two hours lesson is too long, may I direct you to Alison Hill’s Day 09: Hugo Apéro from scratch || rmarkdown + blogdown + Netlify from the #12DaysOfDusting series instead.

Reliable Scientific Software Development

The last presentation of this session was by Meike Steinhilber, developer of the R packages sprtt ( Sequential Probability Ratio Tests Using The Associated t-statistic ) and its Shiny counterpart sprit. The talk focused on ways to develop reliable scientific software using some best practices for software development.

In summary, the best practices were having clean and refactored code, software testing, continuous integration, version control and extended documentation. Examples used were from her R package sprtt.

This is great advice for someone who has just started to have many long R scripts and wish to make them more manageable and sustainable. As this is also the speaker’s first conference presentation, do show her some love and support if you like the presentation.

The presentation gave me a nostalgia feeling as I had presented something similar in the PyData Global 2021 conference using a Python-made software MSOrganiser as an example. That conference was also my first as well. Instead of reliability, the focus of my presentation however was tips to make the software more user friendly, less intimidating for new users and ways to deal with the angry ones as well.

Second Keynote: Applied Machine Learning With tidymodels

Day 3 of the conference ended with a keynote from Julia Silge, author of Supervised Machine Learning for Text Analysis in R. She also posted a lot of tidymodels-related lessons on YouTube as well as the online tutorial Text mining with tidy data principles.

The keynote consisted of a brief introduction of machine learning, followed by the e-book Tidy Modeling with R which she is working on. Julia then presented three things that made the job of a machine learning practitioner challenging and showed how tidymodels was able to help the practitioner cope with those challenges.

The first challenge for a machine learning practitioner is to decide how much data should be used as training data to train the model and how much is used as test data to evaluate the model. If we allocate too much on training data, we will have a model that is unable to generalize well on new, unseen data. On the other hand, allocating too much on the test data will give an unoptimised model that gives inconsistent predictions on input data with similar properties.

The tidymodels package best suited for is rsample which provides not just data splitting between training and testing data but also resampling of the training data (simulated versions of training data) by cross validation or bootstrapping for the optimisation of the model’s complex parameters.

Another tough decision a machine learning practitioner has to make is to determine when does the model building process starts and ends. To do this, a proper model building workflow is required. A common misconception is the belief that the model building process starts only when we started using a given model to fit the training data. Julia warned that such a workflow is susceptible to data leakage. Instead, the model building workflow should include any preprocessing steps, feature engineering processes, the model fit itself and post-processing activities.

However, managing such a complex workflow can be hard, especially when many different kinds of models are used. As such, the recipes and workflows R packages from tidymodels are created to assist machine learning practitioners to better organise their machine learning related projects.

After finalising the machine learning model, the machine learning practitioner can now deploy the model into the external clients system. However, the work does not end here. Overtime, the deployed model may decrease its effectiveness in making good predictions because the properties of input data have changed. The machine learning practitioner must monitor the model and wisely determine when it is the time to collect new data to create an updated model that performs better than the previous one. As many versions of models are created, a report may be required from the external clients to give transparency on how well different models are doing when compared to its predecessor. This process of model maintenance is also called Machine Learning Operations or MLOps for short.

To facilitate a smooth MLOps process for the programming language of Python and R, the last R package that Julia introduced is vetiver. vetiver automatically generates Dockerfiles so that the trained model can be deployed easily. Moreover, vetiver is able to create APIs for the machine learning practitioner to monitor how well the trained model as well as the option to provide a report card to summarise how the model is doing over time. In suumary, vetiver is an open source tool created to provide a decent framework to version, share, deploy, and monitor a trained model.

The keynote was indeed rich in content regarding the best practices for machine learning.

Conclusion

That is all I have learnt from Day 3. It has been a long one. Only one more day to go.