2 Starting Out Tips
Data4All
2.1 Rows and columns
- Rows are for observations
- Columns are for variables
- Keep the data type the same in a column
- Numbers (Integer vs Decimal)
- Text
- Avoid more than one table per spreadsheet
2.2 Be Consistent
- Use consistent codes for categorical variables
- Use a consistent fixed code for any missing values
- Use consistent subject identifiers
- Use consistent date formatting
2.3 No empty cells
- If the data is missing, explicitly encode it as missing
- Avoid headers with more than one row
2.3.1 Original
2.3.2 Better Version
2.3.3 Can we load the original?
What does it look like when we try to load?
2.3.4 Loading the Better Version
With the better version of the dataset, we can load it and group by variables.
Try changing the grouping variable from genotype
to strain
.
2.4 Choose Good Names for Things
- Variable names: avoid spaces, avoid starting with numbers.
- Do: Use numbers and letters, avoid special characters & symbols
- Do: use
_
instead of spaces - Have an internal column name (
max_temp
) and a displayed name (Maximum Temp (°C)
)
Why is this necessary?
- Spaces can be hard to deal with in variable names
- Special characters may not be accepted in R/Python
|>
my_data filter(`Maximum Temp (°C)` > 10)
- R doesn’t like column names to begin with numbers (it changes
1st_place
toX1st_place
)
For Data Scientists:
janitor::clean_names()
clean_names()
will remove spaces and special characters, and use camel case- Instead of
Maximum Temp (°C)
- will transform tomaximum_temp_c
- Removes capitalization
- Removes accents and diacriticals
2.5 Put just one thing in a cell
- Avoid combining columns into a single column
- Avoid putting multiple bits of information in a cell
Example:
Better
Data Science Tools:
tidy::separate()
2.6 Write Dates as YYYY-MM-DD
- DD-MM-YYYY has a lot of issues
- Convert YYYY-MM-DD to text when you load
- Or use as YYYYMMDD as integer