2  Starting Out Tips

Data4All

Author

Ted Laderas, PhD

2.1 Rows and columns

  • Rows are for observations
  • Columns are for variables
  • Keep the data type the same in a column
    • Numbers (Integer vs Decimal)
    • Text
  • Avoid more than one table per spreadsheet

2.2 Be Consistent

  • Use consistent codes for categorical variables
  • Use a consistent fixed code for any missing values
  • Use consistent subject identifiers
  • Use consistent date formatting

2.3 No empty cells

  • If the data is missing, explicitly encode it as missing
  • Avoid headers with more than one row

2.3.1 Original

2.3.2 Better Version

2.3.3 Can we load the original?

What does it look like when we try to load?

2.3.4 Loading the Better Version

With the better version of the dataset, we can load it and group by variables.

Try changing the grouping variable from genotype to strain.

2.4 Choose Good Names for Things

  • Variable names: avoid spaces, avoid starting with numbers.
  • Do: Use numbers and letters, avoid special characters & symbols
  • Do: use _ instead of spaces
  • Have an internal column name (max_temp) and a displayed name (Maximum Temp (°C))

Why is this necessary?

  • Spaces can be hard to deal with in variable names
  • Special characters may not be accepted in R/Python
my_data |>
   filter(`Maximum Temp (°C)` > 10)
  • R doesn’t like column names to begin with numbers (it changes 1st_place to X1st_place)
For Data Scientists: janitor::clean_names()
  • clean_names() will remove spaces and special characters, and use camel case
  • Instead of Maximum Temp (°C) - will transform to maximum_temp_c
  • Removes capitalization
  • Removes accents and diacriticals

2.5 Put just one thing in a cell

  • Avoid combining columns into a single column
  • Avoid putting multiple bits of information in a cell

Example:

Better

Data Science Tools: tidy::separate()

2.6 Write Dates as YYYY-MM-DD

  • DD-MM-YYYY has a lot of issues
  • Convert YYYY-MM-DD to text when you load
  • Or use as YYYYMMDD as integer