In this project, I will use unsupervised learning techniques to identify key customer segments for a German mail-order sales company. The goal is to target marketing campaigns more effectively by focusing on the most promising segments. The data is provided by Bertelsmann Arvato Analytics, representing a real-world data science challenge.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import operator
from sklearn.cluster import KMeans
# magic word for producing visualizations in notebook
%matplotlib inline
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv
: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv
: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md
: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv
: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. This information is used to cluster the general population into groups with similar demographic properties. Then see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv', sep=';')
# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep=';')
# Check the structure of the data after it's loaded
azdias.shape
azdias.describe()
azdias.head(5)
feat_info.shape
feat_info.describe()
feat_info.head(40)
I started by evaluating the demographics data for missing values. Using the feature summary file, I analyzed each column to guide my cleaning decisions. It's important to document observations and decisions in the provided discussion cells.
Next, I converted any 'missing' or 'unknown' value codes from the data dictionary into NaN values for consistency. This required parsing the data since missing codes were read as strings. I also noted how much data was flagged as missing or unknown compared to naturally missing data.
#values missing before conversion
azdias.isna().sum().sum()
# Identify missing or unknown data values and convert them to NaNs.
for attribute, miss in zip(feat_info['attribute'], feat_info['missing_or_unknown']):
missing_values = miss.strip('[]').split(',')
miss_list = ['X', 'XX', ''] #including X and XX due to errors when running the following int conversion
missing_values = [int(value) if (value not in miss_list) else value for value in missing_values]
if missing_values != ['']:
azdias[attribute] = azdias[attribute].replace(missing_values, np.nan)
#values missing after conversion
azdias.isna().sum().sum()
I used Matplotlib's hist() function to visualize the distribution of missing values across columns. This helped me identify outlier columns with a high proportion of missing data. I decided to remove these columns from the dataframe, although I documented any notable findings about them in the discussion section. Additionally, I examined the remaining features for any patterns in missing data, such as columns that share similar missing values
#Perform an assessment of how much missing data there is in each column of the dataset.
#columns with null values
col_null = [col for col in azdias.columns if azdias[col].isna().any()]
col_null
#getting the number of missing null values for those columns
col_null_counts = {col: azdias[col].isna().sum() for col in col_null}
col_null_counts
# Investigate patterns in the amount of missing data in each column.
#% of missing data for each column
p_col_null = {col: (azdias[col].isna().mean() * 100) for col in col_null}
p_col_null
#converting to df to make it easier to merge and plot
df1 = pd.DataFrame(list(col_null_counts.items()), columns=['Attribute', 'Count'])
df2 = pd.DataFrame(list(p_col_null.items()), columns=['Attribute', 'Percentage'])
df2.shape
#merging into one null df
null_df = pd.merge(df1, df2, on='Attribute')
null_df.head()
#sorting the null values by count
null_df = null_df.sort_values(by='Count', ascending=False)
null_df
#sorting null values by precentage
null_df.sort_values(by='Percentage', ascending=False)
#histogram of count column
null_df['Count'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Count');
#histogram of percentage column
null_df['Percentage'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Percentage');
# Investigate patterns in the amount of missing data in each column.
null_df.plot.bar(x='Attribute', y='Percentage', figsize=(18, 8))
plt.xlabel('Attribute')
plt.ylabel('Percentage')
plt.title('Bar Graph of Percentage');
# Removed the outlier columns from the dataset.
outliers = null_df.loc[null_df['Percentage'] > 20, 'Attribute'].tolist()
outliers
#dropping the columns
azdias = azdias.drop(columns=outliers)
azdias.shape #shows that 6 columns were dropped
#dropping outlier rows from null df
null_df = null_df[~null_df['Attribute'].isin(outliers)]
#looking at the plot again of the remaining columns with nulls
null_df.plot.bar(x='Attribute', y='Percentage', figsize=(18, 8))
plt.xlabel('Attribute')
plt.ylabel('Percentage')
plt.title('Bar Graph of Percentage');
Among the 85 columns in this dataset, 6 columns exhibited null values exceeding 20%. The majority of columns with missing values reported nulls in 8% to 18% of the fields. Additionally, several columns shared identical percentages of null values with each other.
Next, I assessed missing data at the row level. I divided the dataset into two subsets: one with rows that exceed a threshold of missing values and another with rows below that threshold.
To determine how to handle outlier rows, I compared the distributions of non-missing data across both subsets. I selected at least five columns with little to no missing data and used Seaborn's countplot() and Matplotlib's subplot() to create side-by-side comparisons.
Based on the comparison results, I decided whether dropping the rows with many missing values would affect the analysis. Regardless, I continued the analysis using the subset with fewer missing values.
# How much data is missing in each row of the dataset?
azdias['row_null_count'] = azdias.isnull().sum(axis=1)
azdias['row_null_count'].describe()
azdias['row_null_count'].value_counts()
azdias['row_null_count'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Null Counts Per Row');
#divide the data into two subsets based on the number of missing values in each row.
data_low = azdias.query('row_null_count <= 25')
data_high = azdias.query('row_null_count > 25')
#number of rows in the low subset
data_low.shape
#number of rows in the high subset
data_high.shape
#percentage of rows in the high subset
93269/8373929 * 100
#compare the distribution of values for columns where there are no or few missing values, between the two subsets.
#finding which columns have no null values
col_not_null = [col for col in azdias.columns if azdias[col].notnull().all()]
col_not_null
#number of columns with no null
len(col_not_null)
def compare_column_distribution(data1, data2, column):
'''
This function creates the same graph for two data sets side by side.
data1 - first data set
data2 - second data set
column - name of the column
'''
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.countplot(x=column, data=data1)
plt.title(f'Distribution in {column} data1')
plt.subplot(1, 2, 2)
sns.countplot(x=column, data=data2)
plt.title(f'Distribution in {column} data2')
plt.show();
#running a loop to input all of the columns through the function
for col in col_not_null:
compare_column_distribution(data_low, data_high, col)
The dataset exhibiting minimal or no missing values (25 or fewer) starkly differs from the dataset with a substantial number of missing values. The segregation of data induces notable changes in the distribution.
It's essential to prepare the dataset for analysis by ensuring all features are encoded numerically. Although most values are numbers, not all represent numeric data. Here's how I approached this:
After making these decisions, I created a new dataframe with the selected and engineered columns, ready for machine learning.
#creating copy of feat_info df
copy_feat = feat_info.copy()
copy_feat.shape
#dropping rows in new df to match the previous dropped columns
copy_feat = copy_feat[~copy_feat['attribute'].isin(outliers)]
copy_feat.shape
#how many features are there of each data type?
copy_feat.type.value_counts()
#creating the list
cat_to_check = []
mix_to_check = []
#appending to the list all the columns that will need to be checked
for index, row in copy_feat.iterrows():
if row['type'] == 'categorical':
cat_to_check.append(row['attribute'])
elif row['type'] == 'mixed':
mix_to_check.append(row['attribute'])
#verifying that the correct total has been added to the list
(len(cat_to_check), len(mix_to_check))
#assessing the categorical variables: which are binary, which are multi-level, and which one needs to be re-encoded?
#create empty lists
binary_feat = []
multi_feat = []
#seperating binary and multi-level categorical variables
for i in cat_to_check:
if (data_low[i].nunique() > 2):
multi_feat.append(i)
else:
binary_feat.append(i)
#checking the binary to find the one with non-numeric values
data_low[binary_feat].head()
#checking the values
data_low.OST_WEST_KZ.unique()
#re-encode the values as numbers
data_low['OST_WEST_KZ'].replace('O', 1, inplace = True)
data_low['OST_WEST_KZ'].replace('W', 0, inplace = True)
#checking that it worked
data_low.OST_WEST_KZ.unique()
data_low[multi_feat].head()
# Re-encode categorical variable(s) to be kept in the analysis.
data_low= data_low.drop(multi_feat, axis = 1)
data_low.shape
To streamline the procedure all multi-feature category variables were excluded and dropped. Additionally, a binary variable was transformed into numeric values.
#checking which columns are mixed variables
mix_to_check
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.
data_low.PRAEGENDE_JUGENDJAHRE.value_counts()
'''
From the data dictionary -
- 1: 40s - war years (Mainstream, E+W)
- 2: 40s - reconstruction years (Avantgarde, E+W)
- 3: 50s - economic miracle (Mainstream, E+W)
- 4: 50s - milk bar / Individualisation (Avantgarde, E+W)
- 5: 60s - economic miracle (Mainstream, E+W)
- 6: 60s - generation 68 / student protestors (Avantgarde, W)
- 7: 60s - opponents to the building of the Wall (Avantgarde, E)
- 8: 70s - family orientation (Mainstream, E+W)
- 9: 70s - peace movement (Avantgarde, E+W)
- 10: 80s - Generation Golf (Mainstream, W)
- 11: 80s - ecological awareness (Avantgarde, W)
- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
- 13: 80s - Swords into ploughshares (Avantgarde, E)
- 14: 90s - digital media kids (Mainstream, E+W)
- 15: 90s - ecological awareness (Avantgarde, E+W)
'''
# will code Avantgard = 0 and Mainstream = 1
# 40s = 0, 50s = 1, 60s = 2, 70s = 3, 80s = 4, 90s = 5
decade = {0: [1, 2], 1: [3, 4], 2: [5, 6, 7], 3: [8, 9], 4: [10, 11, 12, 13], 5: [14, 15]}
movement = {0: [2, 4, 6, 7, 9, 11, 13, 15], 1: [1, 3, 5, 8, 10, 12, 14]}
#creating new columns
data_low['DECADE'] = data_low['PRAEGENDE_JUGENDJAHRE']
data_low['MOVEMENT'] = data_low['PRAEGENDE_JUGENDJAHRE']
#checking that the columns populated correctly
data_low.head()
#creating the maps for the lists in the dict
decade_map = {value: key for key, values in decade.items() for value in values}
movement_map= {value: key for key, values in movement.items() for value in values}
#replacing and checking decade column
data_low['DECADE'] = data_low['DECADE'].replace(decade_map)
data_low['DECADE'].head()
#drop old column
data_low = data_low.drop(columns = 'PRAEGENDE_JUGENDJAHRE')
#replacing and checking decade column
data_low['MOVEMENT'] = data_low['MOVEMENT'].replace(movement_map)
data_low['MOVEMENT'].head()
# Investigate "CAMEO_INTL_2015" and engineer two new variables.
data_low['CAMEO_INTL_2015'].head()
'''
From the data dictionary
German CAMEO: Wealth / Life Stage Typology, mapped to international code
- -1: unknown
- 11: Wealthy Households - Pre-Family Couples & Singles
- 12: Wealthy Households - Young Couples With Children
- 13: Wealthy Households - Families With School Age Children
- 14: Wealthy Households - Older Families & Mature Couples
- 15: Wealthy Households - Elders In Retirement
- 21: Prosperous Households - Pre-Family Couples & Singles
- 22: Prosperous Households - Young Couples With Children
- 23: Prosperous Households - Families With School Age Children
- 24: Prosperous Households - Older Families & Mature Couples
- 25: Prosperous Households - Elders In Retirement
- 31: Comfortable Households - Pre-Family Couples & Singles
- 32: Comfortable Households - Young Couples With Children
- 33: Comfortable Households - Families With School Age Children
- 34: Comfortable Households - Older Families & Mature Couples
- 35: Comfortable Households - Elders In Retirement
- 41: Less Affluent Households - Pre-Family Couples & Singles
- 42: Less Affluent Households - Young Couples With Children
- 43: Less Affluent Households - Families With School Age Children
- 44: Less Affluent Households - Older Families & Mature Couples
- 45: Less Affluent Households - Elders In Retirement
- 51: Poorer Households - Pre-Family Couples & Singles
- 52: Poorer Households - Young Couples With Children
- 53: Poorer Households - Families With School Age Children
- 54: Poorer Households - Older Families & Mature Couples
- 55: Poorer Households - Elders In Retirement
- XX: unknown
'''
#creating new columns
data_low['WEALTH'] = data_low['CAMEO_INTL_2015']
data_low['LIFE_STAGE'] = data_low['CAMEO_INTL_2015']
data_low.head()
#slicing the strings
data_low['WEALTH']= data_low['WEALTH'].astype(str).str[-1]
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(str).str[1:]
data_low['LIFE_STAGE'].head()
#checking the values
data_low['LIFE_STAGE'].unique()
#creating masks to fix the null values
m = data_low['LIFE_STAGE'].str.contains('an', case=False, regex=True)
ma = data_low['WEALTH'].str.contains('n', case=False, regex=True)
#fixing the null values
data_low.loc[m, 'LIFE_STAGE'] = np.nan
data_low.loc[ma, 'WEALTH'] = np.nan
#double checking
data_low['LIFE_STAGE'].unique()
data_low['WEALTH'].unique()
#changing the type
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(float)
data_low['WEALTH']= data_low['WEALTH'].astype(float)
#dropping the orginal column
data_low = data_low.drop(columns = 'CAMEO_INTL_2015')
#dropping mixed columns that have duplicate info
data_low = data_low.drop(columns = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB'])
#checking the data type of this column
data_low['WOHNLAGE'].describe()
data_low['PLZ8_BAUMAX'].describe()
We generated two new columns, 'MOVEMENT' and 'DECADE,' from the original mixed feature "PRAEGENDE_JUGENDJAHRE." These new columns were created using data from the original feature and information from a data dictionary. In this process, we devised two dictionaries to specify the new values for these columns. 'MOVEMENT' was assigned a binary variable, 1 or 0, to distinguish between mainstream and avant-garde, while 'DECADE' was assigned an interval-type variable.
After verifying that the data had been correctly transferred, we removed the original "PRAEGENDE_JUGENDJAHRE" column.
Similarly, for the "CAMEO_INTL_2015" column, which contained two-digit integers representing 'WEALTH' and 'LIFE_STAGE,' we separated the tens-place to represent wealth and the ones-place value to represent life stage. Once we ensured the data was accurate, we addressed any null values and subsequently dropped the original "CAMEO_INTL_2015" column.
Per the data dictionary, the columns 'WOHNLAGE' and 'PLZ8_BAUMAX' are already an interval-type variable so these did not need to be changed. Columns 'LP_LEBENSPHASE_FEIN' and 'LP_LEBENSPHASE_GROB' include life stage infomation. These columns were dropped since we already have that information.
data_low.info()
#dropping row_null_count since it is no longer needed
data_low = data_low.drop(columns = 'row_null_count')
def clean_data(df):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
# convert missing value codes into NaNs, ...
for attribute, miss in zip(feat_info['attribute'], feat_info['missing_or_unknown']):
missing_values = miss.strip('[]').split(',')
miss_list = ['X', 'XX', ''] #including X and XX due to errors when running the following int conversion
missing_values = [int(value) if (value not in miss_list) else value for value in missing_values]
if missing_values != ['']:
df[attribute] = df[attribute].replace(missing_values, np.nan)
# remove selected columns and rows, ...
outliers = ['TITEL_KZ', 'AGER_TYP', 'KK_KUNDENTYP', 'KBA05_BAUMAX', 'GEBURTSJAHR', 'ALTER_HH']
df = df.drop(columns=outliers)
df['row_null_count'] = df.isnull().sum(axis=1)
data_low = df.query('row_null_count <= 25')
# select, re-encode, and engineer column values.
#### correcting column 'OST_WEST_KZ'
data_low['OST_WEST_KZ'].replace('O', 1, inplace = True)
data_low['OST_WEST_KZ'].replace('W', 0, inplace = True)
#### dropping multi-feature columns
multi_feat = ['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN',
'LP_STATUS_GROB', 'NATIONALITAET_KZ', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'CAMEO_DEUG_2015',
'CAMEO_DEU_2015']
data_low= data_low.drop(multi_feat, axis = 1)
###correcting "PRAEGENDE_JUGENDJAHRE"
decade = {0: [1, 2], 1: [3, 4], 2: [5, 6, 7], 3: [8, 9], 4: [10, 11, 12, 13], 5: [14, 15]}
movement = {0: [2, 4, 6, 7, 9, 11, 13, 15], 1: [1, 3, 5, 8, 10, 12, 14]}
data_low['DECADE'] = data_low['PRAEGENDE_JUGENDJAHRE']
data_low['MOVEMENT'] = data_low['PRAEGENDE_JUGENDJAHRE']
decade_map = {value: key for key, values in decade.items() for value in values}
movement_map= {value: key for key, values in movement.items() for value in values}
data_low['DECADE'] = data_low['DECADE'].replace(decade_map)
data_low['MOVEMENT'] = data_low['MOVEMENT'].replace(movement_map)
data_low = data_low.drop(columns = 'PRAEGENDE_JUGENDJAHRE')
####correcting 'CAMEO_INTL_2015'
data_low['WEALTH'] = data_low['CAMEO_INTL_2015']
data_low['LIFE_STAGE'] = data_low['CAMEO_INTL_2015']
data_low['WEALTH']= data_low['WEALTH'].astype(str).str[-1]
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(str).str[1:]
m = data_low['LIFE_STAGE'].str.contains('an', case=False, regex=True)
ma = data_low['WEALTH'].str.contains('n', case=False, regex=True)
data_low.loc[m, 'LIFE_STAGE'] = np.nan
data_low.loc[ma, 'WEALTH'] = np.nan
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(float)
data_low['WEALTH']= data_low['WEALTH'].astype(float)
data_low = data_low.drop(columns = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'CAMEO_INTL_2015', 'row_null_count'])
####fill nulls
fill_null = Imputer(strategy='median')
new_df = pd.DataFrame(fill_null.fit_transform(data_low), columns=data_low.columns)
# Return the cleaned dataframe.
return new_df
Before applying dimensionality reduction, I performed feature scaling to ensure that the principal component vectors are not influenced by differences in feature scales. Using the `StandardScaler` from `sklearn`, I scaled each feature to have a mean of 0 and a standard deviation of 1.
Before scaling, I made sure the data was free of missing values. I considered different approaches, such as removing data points with missing values or applying imputation, and selected the method that best suited the amount of missing data in the dataset.
Finally, I used the `.fit_transform()` method to both fit the scaler and transform the data, ensuring that the scaling parameters are saved for later use on customer demographics data.
(data_low.shape[0], data_low.isna().sum().sum())
#will fill in the null values with the median value for the column
fill_null = Imputer(strategy='median')
#creating a new df with the null values filled in
#used https://stackoverflow.com/questions/33660836/impute-entire-dataframe-all-columns-using-scikit-learn-sklearn-without-itera
# for syntax
df = pd.DataFrame(fill_null.fit_transform(data_low), columns=data_low.columns)
##checking to make sure the column names transfered over like expected
df.head()
#checking to make sure there are no null values
df.isna().sum().sum()
# Apply feature scaling to the general population demographics data.
s = StandardScaler()
s_df = s.fit(df)
s_df = s.transform(df)
s_df = pd.DataFrame(s_df, columns=df.columns)
Several fields within the dataset contained null values, making it impractical to remove all of them. To address this issue, an imputer was employed to replace the null values with the median value from the respective column in which the null value is located. This approach was chosen in an effort to minimize the potential impact on the subsequent analysis.
Now that the data is scaled, I applied Principal Component Analysis (PCA) using `sklearn`'s PCA class. Initially, I computed all components or set the number of components to at least half the number of features to capture the general trend in variability.
I examined the explained variance ratio for each principal component and plotted the cumulative variance using Matplotlib's `plot()` function. Based on the results, I selected the optimal number of components to retain for clustering.
Finally, I re-fitted the PCA instance to transform the data according to the chosen number of components. This reduced dataset will be used for the next steps in the project.
# Apply PCA to the data.
pca = PCA()
fit_pca = pca.fit(s_df)
# Investigate the variance accounted for by each principal component.
fit_pca.explained_variance_ratio_
#scree plot
plt.figure(figsize=(10, 6))
plt.plot(range(fit_pca.explained_variance_ratio_.shape[0]), np.cumsum(fit_pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.ylabel("Variance Explained")
plt.show();
# Re-apply PCA to the data while selecting for number of components to retain.
pca = PCA(n_components = 30)
data_pca = pca.fit_transform(s_df)
data_pca
pca.explained_variance_ratio_.sum()
Following the initial fitting of the data and the generation of a scree plot, the decision was made to utilize 30 principal components. This choice was made as additional components beyond 30 did not appear to significantly impact the results.
With our transformed principal components, it’s helpful to examine the influence of each original feature on the first few components. Each principal component is a unit vector pointing in the direction of highest variance. Features with weights far from zero have the most influence on that component.
To analyze this, I mapped each weight to its corresponding feature, sorted the features by their weights, and focused on the most prominent features at both ends of the list. Features with large weights of the same sign suggest positive correlations, while opposite signs indicate negative correlations.
I explored the feature associations for the first three principal components by writing a function to print the sorted list of feature weights for any component. This function will be useful later when interpreting cluster tendencies. The data dictionary helped me understand the relationships between these prominent features and what their positive or negative values might signify.
#map weights for the first principal component to corresponding feature names and then print the linked values, sorted by weight.
def weights(pca, i):
'''
i = index for desired principle component
'''
weight_map = {}
for j, feature in enumerate(s_df.columns):
weight_map[feature] = pca.components_[i][j]
sorted_weight = sorted(weight_map.items(), key=operator.itemgetter(1), reverse=True)
return sorted_weight
#using function to map first principal component
first = weights(pca, 0)
first
#map weights for the second principal component to corresponding feature names and then print the linked values, sorted by weight.
second = weights(pca, 1)
second
#map weights for the third principal component to corresponding feature names
third = weights(pca, 2)
third
The first principal component, PLZ8_ANTG3, represents the number of 6-10 family (large family) houses in the region and has a positive influence. On the other hand, the feature with the most negative weight, MOBI_REGIO, characterizes movement patterns.
In the second principal component, the most positively weighted feature is ALTERSKATEGORIE_GROB, which estimates age based on an analysis of their name. Conversely, the feature with the most negative weight in this component is SEMIO_REL, which identifies their religious personality typology.
The third principal component is characterized by the feature with the largest positive weight, SEMIO_VERT, representing a dreamful personality typology. The feature with the largest negative weight in this component is ANREDE_KZ, which corresponds to gender.
Now that the data is scaled and transformed, it's time to apply k-means clustering to see how the data groups in the principal components space. Here's the approach I took:
This process will help identify the natural groupings in the data and set the stage for further analysis.
# Over a number of different cluster counts...
k = list(range(5, 20, 2)) #list 5-19 odd
def get_kmeans(data, k):
# run k-means clustering on the data and...
kmeans = KMeans(n_clusters=k)
model = kmeans.fit(data)
# compute the average within-cluster distances.
score = np.abs(model.score(data))
return score
# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.
scores = []
for i in k:
scores.append(get_kmeans(data_pca, i))
#scree plot
plt.plot(k, scores, linestyle='solid', marker='o', color='g');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=15)
model = kmeans.fit(data_pca)
gen_pop_predict = model.predict(data_pca)
#finding cluster centers
cluster_centers = model.cluster_centers_
#scatter plot of the data points for each cluster
plt.figure(figsize=(8, 6))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=gen_pop_predict, cmap='PRGn')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='o', s=50, label='Cluster Centers')
plt.title('K-Means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend();
According to the scree plot, it becomes evident that having 15 clusters is the point at which additional clusters cease to provide any significant benefits.
Now that the clusters and cluster centers have been identified for the general population, the next step is to map the customer data onto these clusters. This process involves applying the transformations and fits from the general population to the customer data without re-fitting the models
# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv', sep=';')
customers.head()
customers.isna().sum().sum()
customers.shape[0]
# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.
clean_customers = clean_data(customers)
clean_customers.head()
clean_customers.shape[0]
#scale
cust_s = s.transform(clean_customers)
cust_s = pd.DataFrame(cust_s, columns=clean_customers.columns)
cust_s.head()
#tranforming customers using pca
pca_customers = pca.transform(cust_s)
pca_customers
#predicting clusters
cust_predict = model.predict(pca_customers)
cust_predict
cust_predict.shape
The final task is to compare the cluster distributions between the general population and the customer base to identify which clusters are overrepresented or underrepresented among the customers.
#plots showing clusters
#used for syntax https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18,5))
sns.countplot(gen_pop_predict, ax=ax1)
ax1.set_title('General Clusters')
sns.countplot(cust_predict, ax=ax2)
ax2.set_title('Customer Clusters');
#What kinds of people are part of a cluster that is overrepresented in the customer data compared to the general population?
#used for syntax https://stackoverflow.com/questions/49885007/how-to-use-scikit-learn-inverse-transform-with-new-values
cluster_11 = s.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[11]))
over = pd.Series(data = cluster_11, index=clean_customers.columns)
over
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
cluster_10 = s.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[10]))
under = pd.Series(data = cluster_10, index=clean_customers.columns)
under
The same groups that seem to over represented in the customer data (cluster 11) are the same top groups in the under represented data (cluster 10):
ALTERSKATEGORIE_GROB - Estimated age based on given name analysis - 46 - 60 year olds
ANREDE_KZ - Gender - Male
FINANZ_MINIMALIST - MINIMALIST: low financial interest Avg - Lowest