• Aiden Johnson

The Data Science Method (DSM)-Data Collection, Organization, and Definitions

This is the second article in a series about how to take your data science projects to the next level by using a methodological approach similar to the scientific method coined the Data Science Method. This article is focused on number two; Data Collection, Organization, and Definitions. If you missed the previous article in this series on Problem Identification, go back and read it here.

The Data Science Method

  1. Problem Identification

  2. Data Collection, Organization, and Definitions

  3. Exploratory Data Analysis

  4. Pre-processing and Training Data Development

  5. Fit Models with Training Data Set

  6. Review Model Outcomes — Iterate over additional models as needed.

  7. Identify the Final Model

  8. Apply the Model to the Complete Data Set

  9. Review the Results — Share your findings

  10. Finalize Code and Documentation


Data Collection can vary depending on the scope of the project and data available. First, outline the data problem clearly; Step 1 in the DSM, then ask yourself if the data you have can answer the question of interest. In some cases, writing website scrapers, scouring census and other data web sites for available data can be a time consuming task but necessary task. Occasionally you may have some data provided by your client and may want to find additional data to augment the analysis. Determine the data sources required and consider the possibilities of using proxy datasets if you can’t find data specific to what you’re looking for.

Once you have procured the datasets collate all data sources into a data frame for analysis. This is called building a Model Development Dataset. This requires unifying the formatting and filling in gaps of overlap with NA or other thoughtful fillers. Creating various intermediate variables is a common practice in this step as well. The development data set will have all the raw features you would like to use in your exploratory data analysis and further modeling work. For example, you might have a date-time column in your data but are specifically interested in the year, then you would extract the year as a new column in your data set, similarly perhaps you want to convert your data to a continuous date-time object by creating an epoch date-time column. Collating your data sources at this early point in the workflow allows for clean data processing and easy adjustments in methods at later stages in the work.


Directory organization is one of those things that you may not think of much when you’re just starting out in data science but as you get into multiple iterations of the same model having a well structured environment for your model outputs and potentially intermediate steps and data visualizations is paramount. Containerized approaches may reduce the need for this and highly structured work environments also don’t require an additional organizational directory. The key is to keep things organized, clean, and dated and/or versioned.

Here is an example of a simple modeling directory.

It’s clearly a lacking creativity with the higher folder names, but what it lacks in creativity it gains in simplicity. The date and time stamps help to provide a simple way to identify previous model iterations. Adding a more descriptive name to each modeling folder is good to such as; RandomForest_500, or RCN_3layers.

Here is the R function for the above file directory to get you started:

#set function = modelpath
st=format(Sys.time(), "%d_%b_%Y_%H.%M")
setwd(home) #home is your working dir 
#check if directory exists, if not create it
if (file.exists(subDir)){
} else {
#create output folder
#create figures folder
#create reporting folder

As you start to develop more advanced methods you may need a more complex structure.


Data definitions are often a neglected piece of a data science project. Sometimes this is considered a model documentation component, however it is both a documentation as well as a development piece. The process of developing data definitions prior to model development informs the data science practitioner at a glance about their development dataset. The data scientist can quickly review the data dictionary as necessary during the modeling process to refresh their understanding of specific components. The other benefit is in communication with the client during review of intermediate steps or model reviews. When the data definitions are clear and in writing everyone in the exchange is on the same page about what the data features represent.

The data definition should contain the following items by column:

  1. Column Name

  2. Range of Values or Codes

  3. Data Type (numeric, categorical, timestamp, etc)

  4. Description of Column

  5. Count or percent per unique Value or Code

  6. Number if NA or missing value

One example of data definitions is what kaggle has on their datasets pages about each downloadable dataset. Below is the header for the table on student’s exam performance with some important data descriptors identified.

Example of data definition — from kaggle.com

Here is another example where variable called ‘NETW30’ is described in detail in the sentence above the table. The table defines each the column by the unique codes found in that column, the particular description each code represents as well as the count and percent of each unique code in the column. Based on the description we can infer this is a categorical data column that represents the results of a prior statistical model which may have multiplicative error considerations which are good to be aware of.

An example data definition table for one column called NETW30.

As you create a model development dataset, a directory structure, and data definitions your data science modeling project transitions to a professional standardized beginning to completing a data science project. To receive updates about the Data Science Method or Data Science Professional Development Sign up here.

#DataScience #MachineLearning #python #R #Data

1 view0 comments

Recent Posts

See All