Using ML to Identify Factors and Potential Biases That Influence Career Development Course Fees

Use Cases & Projects, Dataiku Product Hui Xiang Chua

MySkillsFuture is a one-stop portal that enables Singaporeans of all ages to make informed learning and career choices so that they can pursue their skills and ensure career development throughout their lives. People are able to sign up for various career and skills development courses offered by different training providers for a wide range of training areas on the portal (as shown in Figure 4 and 5). 

We make use of the SSG-WSG APIs to get the data relating to course information such as area of training, course title, content, objectives, training hours, fees, training provider, feedback ratings given by participants, etc. Other than understanding the diversity of courses listed, we are interested to find out the factors that influence the course fees and whether there are any common keywords mentioned in the course content that play a part in predicting the course fees. In our analysis, we hope to answer the following questions:   

  • What are the top areas of training? Who are the top training providers?
  • Are certain areas of training more costly than others?
  • Are there certain buzzwords used in the course content description?
  • Do positive feedback ratings given by participants make the course pricing more competitive?
  • Are there certain schemes that the courses are tied to that affect the training fees?

We will make use of the Natural Language Processing (NLP) capabilities within Dataiku to help us perform the machine learning (ML) modeling and analysis easily. If you’re a data scientist working with text data, this blog post might be of interest to you.

The project has the following high-level steps:

  1. Creating a loop to call the SSG-WSG API to return us the course details for over 26,000 courses (the maximum number of courses is limited to 100 per call) and append all the data together. 
  2. Simplify text fields. 
  3. Summarize course content description (to be used as a feature in modeling).
  4. Predict sentiments of the course objectives (to be used as a feature in modeling).
  5. Build ML models to identify factors that influence the training fees. 
  6. Visualize and interpret our analyses.

The Figure below shows the different features in Dataiku that were used in this project. 

overview of project flowFigure 1: Overview of project flow

The following code snippet shows how the data is collected using the API and saved as a Dataiku dataset.

Figure 2: Code snippet used in data collection phase

Figure 2: Code snippet used in data collection phase

The trainingCostperHour for each course is computed (totalCostOfTrainingPerTrainee/ totalTrainingDurationHour) as we want to control for training duration and the log(trainingCostperHour) is used as the target variable in our model due to its skewed nature.

Figure 3: Scatterplot of total training duration against total cost of training

Figure 3: Scatterplot of total training duration against total cost of training 

The top three training areas, based on the number of courses available on the portal, are “Information and Communications,” “Business Management,” and “Engineering.”

top 25 areas of training

Figure 4: Top 25 areas of training

There are a variety of training providers such as educational institutions, government organizations and commercial entities, etc.

Figure 5: Top 20 training providers

Upon performing the ML modeling, we discovered that the LightGBM model gave the best R2 score amongst the algorithms tried. 

Figure 6: Results of the LightGBM model returned from AutoML

Figure 6: Results of the LightGBM model returned from AutoML

Figure 7: List of top important predictor variables based on the LightGBM model. The entire list of variables influencing the outcome was exported for further analysis.

Figure 7: List of top important predictor variables based on the LightGBM model. The entire list of variables influencing the outcome was exported for further analysis.

Relationship Between Training Providers and Course Fees

The most important variable in influencing the training cost was the Singapore Standard Industrial Classification (SSIC) of training providers (where SSIC of the training provider is 85302 which indicates universities). The SSIC is the national standard for classifying economic activities undertaken by economic units and is used for censuses of population, business surveys, and administrative databases. If the course is offered by a training provider that belongs to the universities classification (namely National University of Singapore, Singapore Management University, Nanyang Technological University, Singapore University of Social Sciences, Singapore University of Technology and Design), the course fees per hour will likely be more costly. 

On the other hand, if a course is offered by a training provider that belongs to the polytechnics classification, the course fees per hour will likely be lower than average. Specifically, if offered by the Institute of Technical Education, the course fees per hour will tend to be on the lower end. So, while the training area/course content could be similar (for example, cybersecurity), the course fees per hour rate will generally be higher for a course offered by the National University of Singapore as compared to the Institute of Technical Education.

Relationship Between Training Area/Course Content Description and Course Fees

We noticed that certain words used in the course description have a relationship with the training cost per hour. The word cloud below represents the importance of the word to the outcome (the larger the word, the more important the word is) and the average training cost per hour of each word (the redder the word, the higher the average training cost per hour of courses containing that word in the course description). This is also reflective of the jobs that are more wanted in the market. For example, courses that cover “banking,” “finance,” “healthcare,” “technology,” “data,” “digital,” and “agile” see higher training costs per hour compared to other areas. 

Figure 8: Word cloud showing the importance of words in influencing training cost per hour (blue indicates low training cost per hour while red indicates high training cost per hour)

Figure 8: Word cloud showing the importance of words in influencing training cost per hour (blue indicates low training cost per hour while red indicates high training cost per hour)

Relationship Between Previous Participants Ratings and Course Fees

In addition, positive past participants ratings seem to be associated with lower training costs per hour. This could be a situation whereby participants might tend to give positive ratings for a course with lower fees compared to one with higher fees as they're happy when they felt like they benefited a lot without having to pay so much.  

Figure 9: (left) Boxplot of training cost per hour of courses based on average ratings given by past participants pertaining to questions on quality of course; (right) Boxplot of training cost per hour of courses based on average ratings given by past participants pertaining to questions on training outcomes

Figure 9: (left) Boxplot of training cost per hour of courses based on average ratings given by past participants pertaining to questions on quality of course; (right) Boxplot of training cost per hour of courses based on average ratings given by past participants pertaining to questions on training outcomes  

Relationship Between Government Schemes Course Offered Under and Course Fees

There is no significant difference in training fees per hour whether the course are tagged under the different Singapore-specific governmental schemes such as “pre-employment training” (PET), Mid-Career Enhanced Subsidy (MCES), “Workfare Skills Support” (WSS), Enhance Training Support for SMEs (ETSS), Course Fee Grant, or not.

Hence, from the analysis, we are able to check if the predictor variables identified are of any concern and address them if necessary. For example, we can examine if the training fees for different areas of training can be streamlined to ensure equity, or whether we can normalize fees based on feedback ratings. 

To learn more about the various NLP capabilities within Dataiku, check out this blog post. You can also learn how to perform text cleaning easily in Dataiku here.

You May Also Like

Alteryx to Dataiku: AutoML

Read More

Conquering the Data Deluge Through Streamlined Data Access

Read More

I Have Databricks, Why Do I Need Dataiku?

Read More

Dataiku Makes Machine Learning Accessible, Transparent, & Universal

Read More