Reading Notes 2022 Jul - Aug
This is a summary of the great Medium posts Elise and I read in July and Augest. I have been working on some customer segmentation project at work, so you will find a lot of clustering related posts this time. Please enjoy :)
Machine Learning
- When Clustering Doesn’t Make Sense: Five simple guidelines to see if clustering is really an appropriate solution for your data
- DBSCAN Python Example: The Optimal Value For Epsilon (EPS): algorithm to find the optimal epsilon of DBSCAN algorithm
- Using Non-negative Matrix Factorization to Classify Companies: A case study on using NMF on clustering
- Handling Outliers in Clusters using Silhouette Analysis: How Silhouette coefficient is calculated, and how it can help determine optimal cluster numbers and identify outliers
- Silhouette Method — Better than Elbow Method to find Optimal Clusters: Another post on how to use Silhouette method to find the optimal cluster numbers
- Cluster Analysis: Create, Visualize and Interpret Customer Segments: Common cluster algorithms (K-means and DBSCAN), and how to evaluate and visualize them
- Clustering Evaluation Strategies: Important factors and criteria to evaluate clustering result
- Three Performance Evaluation Metrics of Clustering When Ground Truth Labels Are Not Available: Silhouette Coefficient, and more evaluation metrics
- Clustering Technique for Categorical Data in python: How to cluster categorical data with k-mode and k-prototype
- Top Explainable AI (XAI) Python Frameworks in 2022: A summary of Explainable AI frameworks, including SHAP, LIME, Shapsh, etc.
- Thresholding Outlier Detection Scores with PyThresh: A new package to automatically determine the outlier detection score thresholds
- Finding the Best Classification Threshold for Imbalanced Classifications with the Interactive Confusion Matrix and Line Charts: A great new package to help visualize confusion matrix and determine the best classification threshold
- Feature Investigation: Automatically Detect Drift in your Data Over Time: A framework designed to automatically detect feature drifts, especially for fraud ML models
- Implementation and Limitations of Imputation Methods: Different imputation methods and their use cases, pros, and cons
- Social Network Analysis and Spectral Clustering in Graphs and Networks: how to measure centrality of a social network graphs and how to cluster nodes
- Machine Learning Models and Their Assumptions: A good summary on assumptions on 8 popular machine learning models
- Synthetic Data in E-commerce: How synthetic data can be used in e-commerce industry
- Are You Interpreting Your Logistic Regression Correctly?: How to correctly interpret coefficients of logistic regression
- A 6-Minute Introduction to Causal AI: What is Causal AI and why it is important
Causal Inference
- Reporting the Impact of Your A/B Tests Correctly: Talks about how to report the actual impact of A/B tests on the topline metric, instead of on a small population
- Assessing the Influence of Outliers in A/B Experiments: Quantile Functions and Sensitivity Analysis: how to assess the influence of outliers and how to handle them
- 4 Python Packages to Learn Causal Analysis: Four packages that could be used for causal inference in Python
- Synthetic Control Method: Synthetic control method and an example of how it works
- Causal Inference with Linear Regression: Omitted variables and Irrelevant variables: what are omitted variables and irrelevant variables when doing causal inference with linear regression
- Causal Inference with Linear Regression: Endogeneity: What is Endogeneity in linear regression and hwo to handle it
Statistics
- Parametric vs. Non-parametric Tests, and When to Use Them: Summarize common parametric and non-parametrics tests and use cases
- P-value vs. T-value: what’s the difference?: A good recap on what is t-value and what is p-value
- The Statistical Magic behind the Bootstrap: Why bootstrap works
- Chi-Square Distribution Simply Explained: A great walk-through on what is Chi-Square distribution and how to derive its PDF
- Gamma Distribution Simply Explained: Another great walk-through on the Gamma Distribution, from the same author as the above article
Analytics
- 6 Hierarchical Data Visualizations: Six ways to visualize data in hierarchical structures
- Five Advanced Data Visualizations All Data Scientists Should Know: Five not that well-known but useful data visualizations
- Data Analyst’s Guide in Handling Flooding Data Ad-hoc Requests: Many data analysts encounter flooding ad-hoc requests every day, and this article provides some great advice on how to handle them