2 minute read

This is a summary of the great Medium posts Elise and I read in May and June. Please enjoy :)

Causal Inference

  1. A Survey of Causal Inference Applications at Netflix: Introduces multiple causal inference real world examples at Netflix
  2. psmpy: Propensity Score Matching in Python — And Why It’s Needed: What is propensity score matching and how to achieve it with psmpy
  3. Causal Inference with Linear Regression: Endogeneity: Sources of endogeneity and how to handle the problem
  4. Using Back-Door Adjustment Causal Analysis to Measure Pre-Post Effects: Back-Door Adjustment techniques to measure pre-post effects

Machine Learning

  1. What Data Scientists Keep Missing About Imbalanced Datasets: Explains why imbalance datasets exist, why it’s a problem, and how to handle it
  2. SHAP Plots: The Crystal Ball for UI Test Ideas: How to interpret SHAP plots and how the Indeed team gets ideas from it
  3. XGBoost for Time Series Extrapolation: You’re Gonna Need a Bigger Boat: XGBoost cannot extrapolate so it will not work well for time series extrapolation
  4. XGBoost for Time Series: lightGBM is a Bigger Boat!: Following the last post, the post explains why lightGBM could help in this case
  5. The Problem with Gradient Boosting (Gradient Boosted Gremlins): Why gradient boosting trees could predict values outside the target range and the solutions
  6. Visually understand XGBoost, LightGBM and CatBoost Regularization Parameters: This post lists important hyperparameters ​​used in the tree-based models
  7. The Wrong and Right Way to Approximate Area Under Precision-Recall Curve (AUPRC): Different ways to calculate AUPRC and which is better
  8. Precision-Recall Curve is More Informative than ROC in Imbalanced Data: Napkin Math & More: A good walk-through of why PR curve is better than ROC curve when the data is imbalance
  9. Making Sense of Bias and Variance!: Very easy-to-understand article explaining bias and variance
  10. CatBoost vs. LightGBM vs. XGBoost: A clear comparison of the three most well-known gradient boosting tree models
  11. Improve Your Regression Model Using 5 Tips That No One Talks About: Five tips to diagnose and improve the regression model
  12. Graph Machine Learning at Airbnb: Real examples of how Airbnb use graph machine learning techniques
  13. Building Credit Rating Systems With Scarce Data: This article introduces how we handle the scarce data issue for credit risk evaluation at Brex
  14. How We Built a Probability of Default Model without Default Labels: How at Brex we stabilize credit limits using balance forecasting
  15. Which Models Require Normalized Data?: Why normalization matters and which models need it
  16. Four Mistakes in Clustering You Should Avoid: When creating clustering models, common mistakes and how to avoid them
  17. Best Practices for Visualizing Your Cluster Results: Great techniques to visualize cluster results with code examples

Analytics

  1. How to Measure the ROI of Your Data Team?: Various perspectives and metrics to measure data team’s ROI
  2. The Downside of Data-Driven Decision Making: Why Data-Driven Decision Making could be a problem sometimes with a practical business example
  3. Marketing strategy — How to Go Beyond Propensity Models: The problem with using propensity models to predict churn/adoption and how we should rephrase the business questions and update the data science solution