2 Starting Out Tips

Data4All

Author

Ted Laderas, PhD

2.1 Rows and columns

Rows are for observations
Columns are for variables
Keep the data type the same in a column
- Numbers (Integer vs Decimal)
- Text
Avoid more than one table per spreadsheet

2.2 Be Consistent

Use consistent codes for categorical variables
Use a consistent fixed code for any missing values
Use consistent subject identifiers
Use consistent date formatting

2.3 No empty cells

If the data is missing, explicitly encode it as missing
Avoid headers with more than one row

2.3.1 Original

2.3.2 Better Version

2.3.3 Can we load the original?

What does it look like when we try to load?

R
Python

2.3.4 Loading the Better Version

With the better version of the dataset, we can load it and group by variables.

Try changing the grouping variable from genotype to strain.

R
Python

2.4 Choose Good Names for Things

Variable names: avoid spaces, avoid starting with numbers.
Do: Use numbers and letters, avoid special characters & symbols
Do: use _ instead of spaces
Have an internal column name (max_temp) and a displayed name (Maximum Temp (°C))

Why is this necessary?

Spaces can be hard to deal with in variable names
Special characters may not be accepted in R/Python

my_data |>
   filter(`Maximum Temp (°C)` > 10)

R doesn’t like column names to begin with numbers (it changes 1st_place to X1st_place)

For Data Scientists: janitor::clean_names()

clean_names() will remove spaces and special characters, and use camel case
Instead of Maximum Temp (°C) - will transform to maximum_temp_c
Removes capitalization
Removes accents and diacriticals

2.5 Put just one thing in a cell

Avoid combining columns into a single column
Avoid putting multiple bits of information in a cell

Example:

Better

Data Science Tools: tidy::separate()

2.6 Write Dates as YYYY-MM-DD

DD-MM-YYYY has a lot of issues
Convert YYYY-MM-DD to text when you load
Or use as YYYYMMDD as integer