Reading Notes 2022 Jul - Aug

3 minute read

This is a summary of the great Medium posts Elise and I read in July and Augest. I have been working on some customer segmentation project at work, so you will find a lot of clustering related posts this time. Please enjoy :)

Machine Learning

When Clustering Doesn’t Make Sense: Five simple guidelines to see if clustering is really an appropriate solution for your data
DBSCAN Python Example: The Optimal Value For Epsilon (EPS): algorithm to find the optimal epsilon of DBSCAN algorithm
Using Non-negative Matrix Factorization to Classify Companies: A case study on using NMF on clustering
Handling Outliers in Clusters using Silhouette Analysis: How Silhouette coefficient is calculated, and how it can help determine optimal cluster numbers and identify outliers
Silhouette Method — Better than Elbow Method to find Optimal Clusters: Another post on how to use Silhouette method to find the optimal cluster numbers
Cluster Analysis: Create, Visualize and Interpret Customer Segments: Common cluster algorithms (K-means and DBSCAN), and how to evaluate and visualize them
Clustering Evaluation Strategies: Important factors and criteria to evaluate clustering result
Three Performance Evaluation Metrics of Clustering When Ground Truth Labels Are Not Available: Silhouette Coefficient, and more evaluation metrics
Clustering Technique for Categorical Data in python: How to cluster categorical data with k-mode and k-prototype
Top Explainable AI (XAI) Python Frameworks in 2022: A summary of Explainable AI frameworks, including SHAP, LIME, Shapsh, etc.
Thresholding Outlier Detection Scores with PyThresh: A new package to automatically determine the outlier detection score thresholds
Finding the Best Classification Threshold for Imbalanced Classifications with the Interactive Confusion Matrix and Line Charts: A great new package to help visualize confusion matrix and determine the best classification threshold
Feature Investigation: Automatically Detect Drift in your Data Over Time: A framework designed to automatically detect feature drifts, especially for fraud ML models
Implementation and Limitations of Imputation Methods: Different imputation methods and their use cases, pros, and cons
Social Network Analysis and Spectral Clustering in Graphs and Networks: how to measure centrality of a social network graphs and how to cluster nodes
Machine Learning Models and Their Assumptions: A good summary on assumptions on 8 popular machine learning models
Synthetic Data in E-commerce: How synthetic data can be used in e-commerce industry
Are You Interpreting Your Logistic Regression Correctly?: How to correctly interpret coefficients of logistic regression
A 6-Minute Introduction to Causal AI: What is Causal AI and why it is important

Causal Inference

Reporting the Impact of Your A/B Tests Correctly: Talks about how to report the actual impact of A/B tests on the topline metric, instead of on a small population
Assessing the Influence of Outliers in A/B Experiments: Quantile Functions and Sensitivity Analysis: how to assess the influence of outliers and how to handle them
4 Python Packages to Learn Causal Analysis: Four packages that could be used for causal inference in Python
Synthetic Control Method: Synthetic control method and an example of how it works
Causal Inference with Linear Regression: Omitted variables and Irrelevant variables: what are omitted variables and irrelevant variables when doing causal inference with linear regression
Causal Inference with Linear Regression: Endogeneity: What is Endogeneity in linear regression and hwo to handle it

Statistics

Parametric vs. Non-parametric Tests, and When to Use Them: Summarize common parametric and non-parametrics tests and use cases
P-value vs. T-value: what’s the difference?: A good recap on what is t-value and what is p-value
The Statistical Magic behind the Bootstrap: Why bootstrap works
Chi-Square Distribution Simply Explained: A great walk-through on what is Chi-Square distribution and how to derive its PDF
Gamma Distribution Simply Explained: Another great walk-through on the Gamma Distribution, from the same author as the above article

Analytics

6 Hierarchical Data Visualizations: Six ways to visualize data in hierarchical structures
Five Advanced Data Visualizations All Data Scientists Should Know: Five not that well-known but useful data visualizations
Data Analyst’s Guide in Handling Flooding Data Ad-hoc Requests: Many data analysts encounter flooding ad-hoc requests every day, and this article provides some great advice on how to handle them

Share on

X Facebook LinkedIn Bluesky

Yu Dong

Reading Notes 2022 Jul - Aug

Machine Learning

Causal Inference

Statistics

Analytics

Share on

You May Also Enjoy

Weekly Viz 2025-07-21

My 2025 Weekly Vizzes

Weekly Viz 2025-07-14

Weekly Viz 2025-07-07