The Art of Running Half-Marathons and Controlling My Cholesterol Levels (PUBLIC TABLEAU, CAUSALITY ANALYSIS)

  • a. Detailed Race Records, including my “official” pace times per mile, and total run times for the race, for the various races organized by NYRR and NYCRUNS.
  • b. One of the performance data I am focused on is how fast I run a mile (the average of total time it took me to cross the start and then the finish lines, divided by the race distance) This is because the races I run vary from 5 km (approximately. 3.1 mile) to half-marathon (approximately 13.1 mile, the total running times also differ, making the comparisons of total running times amongst the races difficult.
  • c. My quarterly medical blood tests, administered by Mount-Sinai Medical Group, where the data on my blood “bad” cholesterol levels (LDH in mg/dl) are provided.
  • d. My weekly weight measure in pounds I record using Tanita weight scale, which provide me with  time series records of my weight levels.
  • 1970s: as a high school student participating in after-school track and field and cross-country, I have regularly achieved a pace of 6 minutes and 30 seconds per mile when I run in a multi-mile race.
  • 1980s-2000s:my “prime” in 20s, 30s and 40s. Unfortunately, I turned into a couch potato. I honestly do not remember undertaking any running activity lasting more than a few minutes.
  • 2010-2013: literally following the footsteps of my wife who was and is a running enthusiast, I started to participate in NYRR and NYCRUNS races, but not that actively. My best pace per mile was 8 minute 30 seconds that I recorded in 2013, when I run a 4-mile NYRR race in the Central Park
  • 2013-2017: I still run 4-mile races, but perhaps once or twice a year. During the annual physical checkups, I received yellow flags for the high blood sugar and high cholesterol. I also noticed that my running performance has fallen dramatically, achieving only the running pace of 11 to 12 minutes a mile.
  • 2018-2023: I still run, but only in 5-km (3.1-mile) races, once or twice a year. During the annual physical checkups, I received ref flags for significant overweight, high blook sugar, and high cholesterol levels.

During the early summer of 2024, with the suggestions from doctors and my wife, I started to run regularly by joining a running club and by participating in the official NYRR and NYCR races on a regular basis. I sought to run in almost all the official NYCR races in 2024, and I have been running most of the NYRR and NYCR races starting from Jan 2025. Each separate block in the chart below shows a race, and the numbers in the block indicate the distance of the race.

Yes, I am running longer distances than I used to be.

In NYC, especially the races that are 5km (3.1 miles), 4 miles, 10km (6.2miles) are held at one of the three places:

  • The Central Park (I run 10 races there, 1918-2025)
  • Governors Island (8)
  • Prospect Park (4)

The graph is the breakdown of the locations of the 24 races that I run over the past several years, with the vast majority during the past year. Some of the longer races, however, sometimes cover wider regions (Borough, or Boroughs) some of the toughest half-marathons races I run are held within the Central Park, however, where the runners had to climb up a break-neck Harlem Hills (350+ feet elevation) twice, or three times, depending on the course setting!

During the second half of 2024, I started to  I started to notice a remarkable reduction in my own weight levels, as well as noticeable improvements in my cholesterol levels. I thought there had to be a linkage between the improving my medical trends and my renewed interest in running. At the very least, it created a solid incentive for me to continue!

While the doctor congratulated me about my renewed interest in running and seemingly improved cholesterol and weight levels, he was somewhat more cautious about the exact causality amongst them. He indicated that the

  1. My “data” suffers from limited data points and
  2. The data points are from a very short observation period

Nevertheless, the doctor said measuring the statistical causality amongst the factors are simply not relevant: he thought I was enjoying the running and found incentives to continue and to improve my race activities, and he thought I should focus on the total benefits.

Looking to the next few years, I want to continue to participate in as many races as possible, including half-marathons. I am even hoping that, if I run enough number of races this year, perhaps I could quality for the 2026 New York City Marathons. For now, I am very happy thinking about my chance of running the full marathon, possibly next year, because this person is so very different from who I was a year ago, somebody with a number of serious health issues.

My last chart shows two graphs in one chart, with separate y-axis. What I wanted to show is, however, is that my best running performances are derived for shorter distance (paler circles) of 3 to 4 miles vs half-marathon (circles in dark navy blue). If I were to run for a full marathon, I will definitely have to improve my performances for the longer distance races.

Applying data equality-justice considerations to consumer default prediction model CAN enhance overall profitability

(Random Forest Analytics using Python Codes) When developing the default prediction model for consumer loans, data scientist will be required, first and foremost, to develop the model that accurately predicts future default. “Fairness” of the model may become subordinated to the accuracy. But it does not have to be. As a part of my class project for DATA 70500 at City University of New York Graduate Center and extending a highly accurate random forest model illustrated by an online article to create a “balanced” model.” What follows is a framework where “reduced form” model can still make the default prediction that are as good, or almost as good as the original model, while successfully improving the perceived fairness of the model. The result is still a high accuracy model, with improved possibilities for developing new consumer credit businesses.

THE ORIGINAL MODEL A well written and easy to read introductory article in Medium online magazine, “Predicting Loan Default Risk: A Hands-On guide with Python” illustrated the solid performance of a model based on a random forest method in predicting the default from the profile data. Some elements of “potentially discriminative” profile data, however, such as job descriptions, marriage status, and location of the residency, have been included as independent variables.

Predicting Loan Default Risk: A Hands-On Guide with Python | by Eulene | Medium

“REDUCED FORM” MODEL INCORPORATING LESS “DISCRIMINATORY DATA” AND ADDITIONAL PERFORMANCE EVALUATION MEASURES The article provides a step-by-step python coding to develop a random forest model to predict the likelihood of default amongst a given set of borrowers in India.  The author of the article states the primary purpose of the model is to “help the lenders make informed decisions and minimize potential losses” by “analyzing various factors like income credit history, and economic trends.” The data is obtained from Kaggle Datasets, titled “Understanding Applicant Details for Loan Approval in India” and is described as a “rich collection of variables capturing various aspects of loan applicants, including personal, financial, and demographic information. It includes features such as age, gender, marital status, employment details, income, loan amount, loan term, credit history, and more. Additionally, it provides the target variable indicating whether the loan was approved or not, making it suitable for classification tasks.

Applicant Details For Loan Approve

My question at this stage is: how important are the “discriminatory” factors in predicting the likelihood of defaults? I tried this exercise by removing Columns 8 (Occupation) and 5 (Marital Status) from the original dataset and running the same model, with the independent variables reduced by two. The rationale for eliminating marital status and occupation stems from when I was a young bank trainee about 45 years ago; my instructors had biases against mortgage borrowers based on their marital status and the industries they worked in. I believe these obvious biases are no longer accepted, but the Kaggle data on consumer loans suggests that they have not completely disappeared.

The revised model outcomes, when compared with the original model outputs, showed that:

  1. The models had essentially the same levels of accuracy in predicting defaults.
  2. The new model had a marginally higher risk of predicting “performing” customers who have “defaulted.”
  3. The new model was marginally better at providing loans to somewhat higher-risk customers who nonetheless “performed.” (Please refer to the table below)
 AUCConfusion Curve
 Area Under the CurveCorrectly Predicting Loan DefaultCorrectly Predicting Loan PerformingPredicting Default but Actually PerformingPredicting Performing but Actually Defaulted
Original Model97.66%9.09%84.66%2.55%3.69%
Model Data with Marital Status and Job Description taken out97.66%8.76%85.06%2.18%4.01%

The code for this updated model can be found at the link below.

https://colab.research.google.com/drive/1WVgu7F4U6HW23hNCCz6vq3vE0JO-VOuA?usp=sharing

So, in conclusion, we could strive to reduce discrimination while still achieving statistical excellence and strong business returns.

This experiment has demonstrated a clear potential for reducing the likelihood of discriminatory data being used in default prediction models, without significantly affecting their precision and reliability. This example illustrates that banks and other lending organizations can review their consumer loan default prediction models to act more socially responsibly. The negative impact of removing some independent variables from the models may prove to be quite limited, based on the example I presented above.