3 minute read

This is a summary of the great Medium posts Elise and I read in July and Augest. I have been working on some customer segmentation project at work, so you will find a lot of clustering related posts this time. Please enjoy :)

Machine Learning

  1. When Clustering Doesn’t Make Sense: Five simple guidelines to see if clustering is really an appropriate solution for your data
  2. DBSCAN Python Example: The Optimal Value For Epsilon (EPS): algorithm to find the optimal epsilon of DBSCAN algorithm
  3. Using Non-negative Matrix Factorization to Classify Companies: A case study on using NMF on clustering
  4. Handling Outliers in Clusters using Silhouette Analysis: How Silhouette coefficient is calculated, and how it can help determine optimal cluster numbers and identify outliers
  5. Silhouette Method — Better than Elbow Method to find Optimal Clusters: Another post on how to use Silhouette method to find the optimal cluster numbers
  6. Cluster Analysis: Create, Visualize and Interpret Customer Segments: Common cluster algorithms (K-means and DBSCAN), and how to evaluate and visualize them
  7. Clustering Evaluation Strategies: Important factors and criteria to evaluate clustering result
  8. Three Performance Evaluation Metrics of Clustering When Ground Truth Labels Are Not Available: Silhouette Coefficient, and more evaluation metrics
  9. Clustering Technique for Categorical Data in python: How to cluster categorical data with k-mode and k-prototype
  10. Top Explainable AI (XAI) Python Frameworks in 2022: A summary of Explainable AI frameworks, including SHAP, LIME, Shapsh, etc.
  11. Thresholding Outlier Detection Scores with PyThresh: A new package to automatically determine the outlier detection score thresholds
  12. Finding the Best Classification Threshold for Imbalanced Classifications with the Interactive Confusion Matrix and Line Charts: A great new package to help visualize confusion matrix and determine the best classification threshold
  13. Feature Investigation: Automatically Detect Drift in your Data Over Time: A framework designed to automatically detect feature drifts, especially for fraud ML models
  14. Implementation and Limitations of Imputation Methods: Different imputation methods and their use cases, pros, and cons
  15. Social Network Analysis and Spectral Clustering in Graphs and Networks: how to measure centrality of a social network graphs and how to cluster nodes
  16. Machine Learning Models and Their Assumptions: A good summary on assumptions on 8 popular machine learning models
  17. Synthetic Data in E-commerce: How synthetic data can be used in e-commerce industry
  18. Are You Interpreting Your Logistic Regression Correctly?: How to correctly interpret coefficients of logistic regression
  19. A 6-Minute Introduction to Causal AI: What is Causal AI and why it is important

Causal Inference

  1. Reporting the Impact of Your A/B Tests Correctly: Talks about how to report the actual impact of A/B tests on the topline metric, instead of on a small population
  2. Assessing the Influence of Outliers in A/B Experiments: Quantile Functions and Sensitivity Analysis: how to assess the influence of outliers and how to handle them
  3. 4 Python Packages to Learn Causal Analysis: Four packages that could be used for causal inference in Python
  4. Synthetic Control Method: Synthetic control method and an example of how it works
  5. Causal Inference with Linear Regression: Omitted variables and Irrelevant variables: what are omitted variables and irrelevant variables when doing causal inference with linear regression
  6. Causal Inference with Linear Regression: Endogeneity: What is Endogeneity in linear regression and hwo to handle it

Statistics

  1. Parametric vs. Non-parametric Tests, and When to Use Them: Summarize common parametric and non-parametrics tests and use cases
  2. P-value vs. T-value: what’s the difference?: A good recap on what is t-value and what is p-value
  3. The Statistical Magic behind the Bootstrap: Why bootstrap works
  4. Chi-Square Distribution Simply Explained: A great walk-through on what is Chi-Square distribution and how to derive its PDF
  5. Gamma Distribution Simply Explained: Another great walk-through on the Gamma Distribution, from the same author as the above article

Analytics

  1. 6 Hierarchical Data Visualizations: Six ways to visualize data in hierarchical structures
  2. Five Advanced Data Visualizations All Data Scientists Should Know: Five not that well-known but useful data visualizations
  3. Data Analyst’s Guide in Handling Flooding Data Ad-hoc Requests: Many data analysts encounter flooding ad-hoc requests every day, and this article provides some great advice on how to handle them