Reading Notes 2022 May - Jun

2 minute read

This is a summary of the great Medium posts Elise and I read in May and June. Please enjoy :)

Causal Inference

A Survey of Causal Inference Applications at Netflix: Introduces multiple causal inference real world examples at Netflix
psmpy: Propensity Score Matching in Python — And Why It’s Needed: What is propensity score matching and how to achieve it with psmpy
Causal Inference with Linear Regression: Endogeneity: Sources of endogeneity and how to handle the problem
Using Back-Door Adjustment Causal Analysis to Measure Pre-Post Effects: Back-Door Adjustment techniques to measure pre-post effects

Machine Learning

What Data Scientists Keep Missing About Imbalanced Datasets: Explains why imbalance datasets exist, why it’s a problem, and how to handle it
SHAP Plots: The Crystal Ball for UI Test Ideas: How to interpret SHAP plots and how the Indeed team gets ideas from it
XGBoost for Time Series Extrapolation: You’re Gonna Need a Bigger Boat: XGBoost cannot extrapolate so it will not work well for time series extrapolation
XGBoost for Time Series: lightGBM is a Bigger Boat!: Following the last post, the post explains why lightGBM could help in this case
The Problem with Gradient Boosting (Gradient Boosted Gremlins): Why gradient boosting trees could predict values outside the target range and the solutions
Visually understand XGBoost, LightGBM and CatBoost Regularization Parameters: This post lists important hyperparameters used in the tree-based models
The Wrong and Right Way to Approximate Area Under Precision-Recall Curve (AUPRC): Different ways to calculate AUPRC and which is better
Precision-Recall Curve is More Informative than ROC in Imbalanced Data: Napkin Math & More: A good walk-through of why PR curve is better than ROC curve when the data is imbalance
Making Sense of Bias and Variance!: Very easy-to-understand article explaining bias and variance
CatBoost vs. LightGBM vs. XGBoost: A clear comparison of the three most well-known gradient boosting tree models
Improve Your Regression Model Using 5 Tips That No One Talks About: Five tips to diagnose and improve the regression model
Graph Machine Learning at Airbnb: Real examples of how Airbnb use graph machine learning techniques
Building Credit Rating Systems With Scarce Data: This article introduces how we handle the scarce data issue for credit risk evaluation at Brex
How We Built a Probability of Default Model without Default Labels: How at Brex we stabilize credit limits using balance forecasting
Which Models Require Normalized Data?: Why normalization matters and which models need it
Four Mistakes in Clustering You Should Avoid: When creating clustering models, common mistakes and how to avoid them
Best Practices for Visualizing Your Cluster Results: Great techniques to visualize cluster results with code examples

Analytics

How to Measure the ROI of Your Data Team?: Various perspectives and metrics to measure data team’s ROI
The Downside of Data-Driven Decision Making: Why Data-Driven Decision Making could be a problem sometimes with a practical business example
Marketing strategy — How to Go Beyond Propensity Models: The problem with using propensity models to predict churn/adoption and how we should rephrase the business questions and update the data science solution

Share on

X Facebook LinkedIn Bluesky

Yu Dong

Reading Notes 2022 May - Jun

Causal Inference

Machine Learning

Analytics

Share on

You May Also Enjoy

Weekly Viz 2025-07-21

My 2025 Weekly Vizzes

Weekly Viz 2025-07-14

Weekly Viz 2025-07-07