Reading Notes 2022 May - Jun
This is a summary of the great Medium posts Elise and I read in May and June. Please enjoy :)
Causal Inference
- A Survey of Causal Inference Applications at Netflix: Introduces multiple causal inference real world examples at Netflix
- psmpy: Propensity Score Matching in Python — And Why It’s Needed: What is propensity score matching and how to achieve it with psmpy
- Causal Inference with Linear Regression: Endogeneity: Sources of endogeneity and how to handle the problem
- Using Back-Door Adjustment Causal Analysis to Measure Pre-Post Effects: Back-Door Adjustment techniques to measure pre-post effects
Machine Learning
- What Data Scientists Keep Missing About Imbalanced Datasets: Explains why imbalance datasets exist, why it’s a problem, and how to handle it
- SHAP Plots: The Crystal Ball for UI Test Ideas: How to interpret SHAP plots and how the Indeed team gets ideas from it
- XGBoost for Time Series Extrapolation: You’re Gonna Need a Bigger Boat: XGBoost cannot extrapolate so it will not work well for time series extrapolation
- XGBoost for Time Series: lightGBM is a Bigger Boat!: Following the last post, the post explains why lightGBM could help in this case
- The Problem with Gradient Boosting (Gradient Boosted Gremlins): Why gradient boosting trees could predict values outside the target range and the solutions
- Visually understand XGBoost, LightGBM and CatBoost Regularization Parameters: This post lists important hyperparameters used in the tree-based models
- The Wrong and Right Way to Approximate Area Under Precision-Recall Curve (AUPRC): Different ways to calculate AUPRC and which is better
- Precision-Recall Curve is More Informative than ROC in Imbalanced Data: Napkin Math & More: A good walk-through of why PR curve is better than ROC curve when the data is imbalance
- Making Sense of Bias and Variance!: Very easy-to-understand article explaining bias and variance
- CatBoost vs. LightGBM vs. XGBoost: A clear comparison of the three most well-known gradient boosting tree models
- Improve Your Regression Model Using 5 Tips That No One Talks About: Five tips to diagnose and improve the regression model
- Graph Machine Learning at Airbnb: Real examples of how Airbnb use graph machine learning techniques
- Building Credit Rating Systems With Scarce Data: This article introduces how we handle the scarce data issue for credit risk evaluation at Brex
- How We Built a Probability of Default Model without Default Labels: How at Brex we stabilize credit limits using balance forecasting
- Which Models Require Normalized Data?: Why normalization matters and which models need it
- Four Mistakes in Clustering You Should Avoid: When creating clustering models, common mistakes and how to avoid them
- Best Practices for Visualizing Your Cluster Results: Great techniques to visualize cluster results with code examples
Analytics
- How to Measure the ROI of Your Data Team?: Various perspectives and metrics to measure data team’s ROI
- The Downside of Data-Driven Decision Making: Why Data-Driven Decision Making could be a problem sometimes with a practical business example
- Marketing strategy — How to Go Beyond Propensity Models: The problem with using propensity models to predict churn/adoption and how we should rephrase the business questions and update the data science solution