The Art of Running Half-Marathons and Controlling My Cholesterol Levels (PUBLIC TABLEAU, CAUSALITY ANALYSIS)

  • a. Detailed Race Records, including my “official” pace times per mile, and total run times for the race, for the various races organized by NYRR and NYCRUNS.
  • b. One of the performance data I am focused on is how fast I run a mile (the average of total time it took me to cross the start and then the finish lines, divided by the race distance) This is because the races I run vary from 5 km (approximately. 3.1 mile) to half-marathon (approximately 13.1 mile, the total running times also differ, making the comparisons of total running times amongst the races difficult.
  • c. My quarterly medical blood tests, administered by Mount-Sinai Medical Group, where the data on my blood “bad” cholesterol levels (LDH in mg/dl) are provided.
  • d. My weekly weight measure in pounds I record using Tanita weight scale, which provide me with  time series records of my weight levels.
  • 1970s: as a high school student participating in after-school track and field and cross-country, I have regularly achieved a pace of 6 minutes and 30 seconds per mile when I run in a multi-mile race.
  • 1980s-2000s:my “prime” in 20s, 30s and 40s. Unfortunately, I turned into a couch potato. I honestly do not remember undertaking any running activity lasting more than a few minutes.
  • 2010-2013: literally following the footsteps of my wife who was and is a running enthusiast, I started to participate in NYRR and NYCRUNS races, but not that actively. My best pace per mile was 8 minute 30 seconds that I recorded in 2013, when I run a 4-mile NYRR race in the Central Park
  • 2013-2017: I still run 4-mile races, but perhaps once or twice a year. During the annual physical checkups, I received yellow flags for the high blood sugar and high cholesterol. I also noticed that my running performance has fallen dramatically, achieving only the running pace of 11 to 12 minutes a mile.
  • 2018-2023: I still run, but only in 5-km (3.1-mile) races, once or twice a year. During the annual physical checkups, I received ref flags for significant overweight, high blook sugar, and high cholesterol levels.

During the early summer of 2024, with the suggestions from doctors and my wife, I started to run regularly by joining a running club and by participating in the official NYRR and NYCR races on a regular basis. I sought to run in almost all the official NYCR races in 2024, and I have been running most of the NYRR and NYCR races starting from Jan 2025. Each separate block in the chart below shows a race, and the numbers in the block indicate the distance of the race.

Yes, I am running longer distances than I used to be.

In NYC, especially the races that are 5km (3.1 miles), 4 miles, 10km (6.2miles) are held at one of the three places:

  • The Central Park (I run 10 races there, 1918-2025)
  • Governors Island (8)
  • Prospect Park (4)

The graph is the breakdown of the locations of the 24 races that I run over the past several years, with the vast majority during the past year. Some of the longer races, however, sometimes cover wider regions (Borough, or Boroughs) some of the toughest half-marathons races I run are held within the Central Park, however, where the runners had to climb up a break-neck Harlem Hills (350+ feet elevation) twice, or three times, depending on the course setting!

During the second half of 2024, I started to  I started to notice a remarkable reduction in my own weight levels, as well as noticeable improvements in my cholesterol levels. I thought there had to be a linkage between the improving my medical trends and my renewed interest in running. At the very least, it created a solid incentive for me to continue!

While the doctor congratulated me about my renewed interest in running and seemingly improved cholesterol and weight levels, he was somewhat more cautious about the exact causality amongst them. He indicated that the

  1. My “data” suffers from limited data points and
  2. The data points are from a very short observation period

Nevertheless, the doctor said measuring the statistical causality amongst the factors are simply not relevant: he thought I was enjoying the running and found incentives to continue and to improve my race activities, and he thought I should focus on the total benefits.

Looking to the next few years, I want to continue to participate in as many races as possible, including half-marathons. I am even hoping that, if I run enough number of races this year, perhaps I could quality for the 2026 New York City Marathons. For now, I am very happy thinking about my chance of running the full marathon, possibly next year, because this person is so very different from who I was a year ago, somebody with a number of serious health issues.

My last chart shows two graphs in one chart, with separate y-axis. What I wanted to show is, however, is that my best running performances are derived for shorter distance (paler circles) of 3 to 4 miles vs half-marathon (circles in dark navy blue). If I were to run for a full marathon, I will definitely have to improve my performances for the longer distance races.

Applying data equality-justice considerations to consumer default prediction model CAN enhance overall profitability

(Random Forest Analytics using Python Codes) When developing the default prediction model for consumer loans, data scientist will be required, first and foremost, to develop the model that accurately predicts future default. “Fairness” of the model may become subordinated to the accuracy. But it does not have to be. As a part of my class project for DATA 70500 at City University of New York Graduate Center and extending a highly accurate random forest model illustrated by an online article to create a “balanced” model.” What follows is a framework where “reduced form” model can still make the default prediction that are as good, or almost as good as the original model, while successfully improving the perceived fairness of the model. The result is still a high accuracy model, with improved possibilities for developing new consumer credit businesses.

THE ORIGINAL MODEL A well written and easy to read introductory article in Medium online magazine, “Predicting Loan Default Risk: A Hands-On guide with Python” illustrated the solid performance of a model based on a random forest method in predicting the default from the profile data. Some elements of “potentially discriminative” profile data, however, such as job descriptions, marriage status, and location of the residency, have been included as independent variables.

Predicting Loan Default Risk: A Hands-On Guide with Python | by Eulene | Medium

“REDUCED FORM” MODEL INCORPORATING LESS “DISCRIMINATORY DATA” AND ADDITIONAL PERFORMANCE EVALUATION MEASURES The article provides a step-by-step python coding to develop a random forest model to predict the likelihood of default amongst a given set of borrowers in India.  The author of the article states the primary purpose of the model is to “help the lenders make informed decisions and minimize potential losses” by “analyzing various factors like income credit history, and economic trends.” The data is obtained from Kaggle Datasets, titled “Understanding Applicant Details for Loan Approval in India” and is described as a “rich collection of variables capturing various aspects of loan applicants, including personal, financial, and demographic information. It includes features such as age, gender, marital status, employment details, income, loan amount, loan term, credit history, and more. Additionally, it provides the target variable indicating whether the loan was approved or not, making it suitable for classification tasks.

Applicant Details For Loan Approve

My question at this stage is: how important are the “discriminatory” factors in predicting the likelihood of defaults? I tried this exercise by removing Columns 8 (Occupation) and 5 (Marital Status) from the original dataset and running the same model, with the independent variables reduced by two. The rationale for eliminating marital status and occupation stems from when I was a young bank trainee about 45 years ago; my instructors had biases against mortgage borrowers based on their marital status and the industries they worked in. I believe these obvious biases are no longer accepted, but the Kaggle data on consumer loans suggests that they have not completely disappeared.

The revised model outcomes, when compared with the original model outputs, showed that:

  1. The models had essentially the same levels of accuracy in predicting defaults.
  2. The new model had a marginally higher risk of predicting “performing” customers who have “defaulted.”
  3. The new model was marginally better at providing loans to somewhat higher-risk customers who nonetheless “performed.” (Please refer to the table below)
 AUCConfusion Curve
 Area Under the CurveCorrectly Predicting Loan DefaultCorrectly Predicting Loan PerformingPredicting Default but Actually PerformingPredicting Performing but Actually Defaulted
Original Model97.66%9.09%84.66%2.55%3.69%
Model Data with Marital Status and Job Description taken out97.66%8.76%85.06%2.18%4.01%

The code for this updated model can be found at the link below.

https://colab.research.google.com/drive/1WVgu7F4U6HW23hNCCz6vq3vE0JO-VOuA?usp=sharing

So, in conclusion, we could strive to reduce discrimination while still achieving statistical excellence and strong business returns.

This experiment has demonstrated a clear potential for reducing the likelihood of discriminatory data being used in default prediction models, without significantly affecting their precision and reliability. This example illustrates that banks and other lending organizations can review their consumer loan default prediction models to act more socially responsibly. The negative impact of removing some independent variables from the models may prove to be quite limited, based on the example I presented above.

(Lack of) Heat Complaints in NYC: Borough Focused Analysis from Oct ’24 to Feb ’25: PART 1

NYC OpenData makes the public data generated by various New York City Agencies and other City organization available for public use. NYC 311 Data is a subset of NYC OpenData that collects all 311 Service Requests from 2010 to present. It is a huge data set: in 2018, for example, the number of the 311 customer interactions are said to exceed 35 million in 2022!

Here is the first Tableau Public Chart that shows the relationship between the number of heat (or lack thereof) complains to 311 by Borough during the months of October 2024 through February 2025. The no heat complains during these five months exceeded 250,000

The illustrate the impact of ambient temperatures to the complaints, the average day and night temperatures for NYC during these five months are also added.

            <script type='text/javascript'>                    var divElement = document.getElementById('viz1742249718023');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

There is no surprise that the complains rise as the average temperatures fall. What is less intuitive, however, is that the number of complains appear to rise most, and most noticeably, from the Bronx, the trends become clearer when we see the zip code specific no-heat complains for the months of October 2024 and February 2025.

What is remarkable between the two charts below (October 2024 complaints and January 2025 complains, both charts using the same scale color schemes) is that certain zip codes appear to show dramatic rise in no-heat complains during the “peak/coldest” month of January.

The seriousness of several zip codes areas in Bronx, and one to two zip code regions in Brooklyn, is notable.

The chart below compares the per household incidences of no-heat complains for October 2024 and January 2025 for each Borough. Here again, the seriousness of the no-heat challenge in the Bronx is easy to see, especially during the peak winter month of January where one in 20 households in Bronx appears to call 311 for the lack of heat problems, while the number is one in 50 or less for other Boroughs.

Why is Bronx showing much more serious no-heat problems for its residents? I plan to address this question sometime in the near future.

            <script type='text/javascript'>                    var divElement = document.getElementById('viz1742317678071');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Hello world! …and How Do You Do, Everybody! こんにちは、はじめまして、みんな!

After getting my MBA way back, I have been working for financial services companies, primarily as a risk analyst of sorts. After 45+ years of ups and downs and experiencing increasingly complex financial market risk stresses, I woke up one day and suddenly realized I still did not know what I wanted to do when I grew up.

The fact that I have lasted so many years in the intensely competitive environment meant I have done OK throughout these years, but I also recognized that what some elements of the work I do are much more fun and thus enjoyable than the others. 

Looking forward, I would like to focus more on the elements of risk and financial analysis that I think would be “fun and hopefully life enhancing”, as opposed to “accurate, short-term optimizing and high-performance.” …and here, I would like to take advantage of the new toys– statistical and data visualization tools and methodologies– that I am learning as a part of getting an MS degree in Data Visualization and Analysis at CUNY Graduate Center.

I would like to explore what I mean by fun and enriching risk and economic assessments throughout this blog!