Creating Effective Prompts for R Coding

R Programming
LLMs
Tutorials
A guide to constructing high-quality prompts for leveraging Large Language Models (LLMs) in R.
Authors

Jared Klug

Shaun Porwal

Published

December 24, 2024

Fundamental Principles of Prompt Construction

Set Clear Goals

  • Specify the task and desired outcome.
  • Eliminate ambiguity in the requirements.
Bad Prompt

TASK: Text cleaning function
INPUT: character vector
REQUIREMENTS:
- Remove special characters
- Convert to lowercase
OUTPUT: cleaned vector

Good Prompt

Create an R function that cleans text data by removing special characters and converting to lowercase. The input should be a character vector.

Please provide the code and a brief explanation of your approach.

Use Positive Language

Frame requests in terms of “do this” rather than “don’t do this.” This encourages clarity and actionable guidance.


Natural Language and Rigid Formatting

Modern LLMs excel at understanding natural language. Keep prompts conversational but specific.

Provide Context and Specifications

  • Specify the desired output format (e.g., R function, script, or data frame).
  • Include relevant libraries or packages.
Good Prompt
Using the tidyverse ecosystem in R, write code to analyze my sales data with these specifications:

sales_data <- data.frame(
  date = c("2024-01-01", "2024-01-02", "2024-01-01"),
  region = c("North", "South", "North"),
  sales = c(1200, NA, 1500),
  units = c(50, 45, 60)
)

Requirements:
1. Calculate mean sales and total units by region and month.
2. Format dates appropriately.
3. Return results in a tidied data frame.
Bad Prompt
Average my data in R.

sales_data <- data.frame(
  date = c("2024-01-01", "2024-01-02", "2024-01-01"),
  region = c("North", "South", "North"),
  sales = c(1200, NA, 1500),
  units = c(50, 45, 60)
)

Encourage Reasoning and Explanation

  • Request step-by-step explanations in addition to code.
  • Ask for comments within the code to clarify its functionality.
  • Request a summary of the approach or methodology used.

Optimize for R-Specific Tasks

Data Manipulation

Specify the preferred libraries or frameworks for data handling, such as dplyr.

Visualization

Explicitly mention visualization libraries like ggplot2 or plotly.

Good Prompt
Create a ggplot2 visualization in R that shows the relationship between `mpg` and `hp` from the mtcars dataset. The plot should:
- Use a scatter plot with a smooth trend line.
- Color the points by the `cyl` variable.
- Use a minimal theme.

Improve Code Quality

  • Request performance optimization and adherence to established style guides.
  • Suggest refactoring to ensure modular, reusable code.
Good Prompt
Refactor the following R code to improve performance and readability using the tidyverse style guide and vector operations.

my_function <- function(df) {
  result <- data.frame()
  for (i in 1:nrow(df)) {
    if (!is.na(df$value[i]) & df$category[i] == "A") {
      temp <- data.frame(
        id = df$id[i],
        transformed = sqrt(df$value[i]),
        group = df$category[i]
      )
      result <- rbind(result, temp)
    }
  }
  return(result)
}

Leverage R’s Strengths

  • Vectorization: Request the use of vectorized operations where applicable.
  • Functional Programming: Encourage using apply functions or the purrr library for functional programming paradigms.
  • Data Types: Specify the use of appropriate data structures like lists or data frames to improve efficiency.

Iterative Improvement

  1. Start with a basic prompt and iterate based on feedback.
  2. Review the generated output and refine your requirements.
  3. Use specific follow-ups to clarify or improve aspects of the output.
  4. Restart with a refined prompt when the direction changes significantly.

Summary

Do:

  • Clearly specify the required packages and dependencies.
  • Include sample data or a detailed description of the expected data structure.
  • Request robust error handling and input validation.
  • Break down complex tasks into smaller, manageable prompts.
  • Ask for explanations of choices or methods used.

Don’t:

  • Provide ambiguous or incomplete requirements.
  • Overload prompts with unrelated tasks.
  • Assume implicit knowledge of specific packages without mentioning them.
  • Neglect to specify data types or structures explicitly.