Project: Identify Customer Segments

In this project, I will use unsupervised learning techniques to identify key customer segments for a German mail-order sales company. The goal is to target marketing campaigns more effectively by focusing on the most promising segments. The data is provided by Bertelsmann Arvato Analytics, representing a real-world data science challenge.

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import operator  
from sklearn.cluster import KMeans

# magic word for producing visualizations in notebook
%matplotlib inline

Out[1]:

Step 0: Load the Data

There are four files associated with this project (not including this one):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. This information is used to cluster the general population into groups with similar demographic properties. Then see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

In [2]:
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv', sep=';')

# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep=';')
In [3]:
# Check the structure of the data after it's loaded 
azdias.shape
Out[3]:
(891221, 85)
In [4]:
azdias.describe()
Out[4]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
count 891221.000000 891221.000000 891221.000000 886367.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 ... 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 794005.000000 794005.000000 794005.00000
mean -0.358435 2.777398 1.522098 3.632838 3.074528 2.821039 3.401106 3.033328 2.874167 3.075121 ... 2.253330 2.801858 1.595426 0.699166 1.943913 3.612821 3.381087 3.167854 5.293002 3.07222
std 1.198724 1.068775 0.499512 1.595021 1.321055 1.464749 1.322134 1.529603 1.486731 1.353248 ... 0.972008 0.920309 0.986736 0.727137 1.459654 0.973967 1.111598 1.002376 2.303739 1.36298
min -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.00000
25% -1.000000 2.000000 1.000000 2.000000 2.000000 1.000000 3.000000 2.000000 2.000000 2.000000 ... 1.000000 2.000000 1.000000 0.000000 1.000000 3.000000 3.000000 3.000000 4.000000 2.00000
50% -1.000000 3.000000 2.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ... 2.000000 3.000000 2.000000 1.000000 1.000000 4.000000 3.000000 3.000000 5.000000 3.00000
75% -1.000000 4.000000 2.000000 5.000000 4.000000 4.000000 5.000000 5.000000 4.000000 4.000000 ... 3.000000 3.000000 2.000000 1.000000 3.000000 4.000000 4.000000 4.000000 7.000000 4.00000
max 3.000000 9.000000 2.000000 6.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 ... 4.000000 4.000000 3.000000 2.000000 5.000000 5.000000 5.000000 9.000000 9.000000 9.00000

8 rows × 81 columns

In [5]:
azdias.head(5)
Out[5]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

In [6]:
feat_info.shape
Out[6]:
(85, 4)
In [7]:
feat_info.describe()
Out[7]:
attribute information_level type missing_or_unknown
count 85 85 85 85
unique 85 9 5 9
top FINANZ_MINIMALIST person ordinal [-1]
freq 1 43 49 26
In [8]:
feat_info.head(40)
Out[8]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
5 FINANZ_SPARER person ordinal [-1]
6 FINANZ_VORSORGER person ordinal [-1]
7 FINANZ_ANLEGER person ordinal [-1]
8 FINANZ_UNAUFFAELLIGER person ordinal [-1]
9 FINANZ_HAUSBAUER person ordinal [-1]
10 FINANZTYP person categorical [-1]
11 GEBURTSJAHR person numeric [0]
12 GFK_URLAUBERTYP person categorical []
13 GREEN_AVANTGARDE person categorical []
14 HEALTH_TYP person ordinal [-1,0]
15 LP_LEBENSPHASE_FEIN person mixed [0]
16 LP_LEBENSPHASE_GROB person mixed [0]
17 LP_FAMILIE_FEIN person categorical [0]
18 LP_FAMILIE_GROB person categorical [0]
19 LP_STATUS_FEIN person categorical [0]
20 LP_STATUS_GROB person categorical [0]
21 NATIONALITAET_KZ person categorical [-1,0]
22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0]
23 RETOURTYP_BK_S person ordinal [0]
24 SEMIO_SOZ person ordinal [-1,9]
25 SEMIO_FAM person ordinal [-1,9]
26 SEMIO_REL person ordinal [-1,9]
27 SEMIO_MAT person ordinal [-1,9]
28 SEMIO_VERT person ordinal [-1,9]
29 SEMIO_LUST person ordinal [-1,9]
30 SEMIO_ERL person ordinal [-1,9]
31 SEMIO_KULT person ordinal [-1,9]
32 SEMIO_RAT person ordinal [-1,9]
33 SEMIO_KRIT person ordinal [-1,9]
34 SEMIO_DOM person ordinal [-1,9]
35 SEMIO_KAEM person ordinal [-1,9]
36 SEMIO_PFLICHT person ordinal [-1,9]
37 SEMIO_TRADV person ordinal [-1,9]
38 SHOPPER_TYP person categorical [-1]
39 SOHO_KZ person categorical [-1]

Step 1: Preprocessing

Step 1.1:

I started by evaluating the demographics data for missing values. Using the feature summary file, I analyzed each column to guide my cleaning decisions. It's important to document observations and decisions in the provided discussion cells.

Step 1.1.1: Convert Missing Value Codes to NaNs

Next, I converted any 'missing' or 'unknown' value codes from the data dictionary into NaN values for consistency. This required parsing the data since missing codes were read as strings. I also noted how much data was flagged as missing or unknown compared to naturally missing data.

In [9]:
#values missing before conversion
azdias.isna().sum().sum()
Out[9]:
4896838
In [10]:
# Identify missing or unknown data values and convert them to NaNs.

for attribute, miss in zip(feat_info['attribute'], feat_info['missing_or_unknown']):
    missing_values = miss.strip('[]').split(',')
    miss_list = ['X', 'XX', ''] #including X and XX due to errors when running the following int conversion
    missing_values = [int(value) if (value not in miss_list) else value for value in missing_values]
    if missing_values != ['']:
        azdias[attribute] = azdias[attribute].replace(missing_values, np.nan)
In [11]:
#values missing after conversion
azdias.isna().sum().sum()
Out[11]:
8373929

Step 1.1.2: Assess Missing Data in Each Column

I used Matplotlib's hist() function to visualize the distribution of missing values across columns. This helped me identify outlier columns with a high proportion of missing data. I decided to remove these columns from the dataframe, although I documented any notable findings about them in the discussion section. Additionally, I examined the remaining features for any patterns in missing data, such as columns that share similar missing values

In [12]:
#Perform an assessment of how much missing data there is in each column of the dataset.

#columns with null values
col_null = [col for col in azdias.columns if azdias[col].isna().any()]
col_null
Out[12]:
['AGER_TYP',
 'ALTERSKATEGORIE_GROB',
 'CJT_GESAMTTYP',
 'GEBURTSJAHR',
 'GFK_URLAUBERTYP',
 'HEALTH_TYP',
 'LP_LEBENSPHASE_FEIN',
 'LP_LEBENSPHASE_GROB',
 'LP_FAMILIE_FEIN',
 'LP_FAMILIE_GROB',
 'LP_STATUS_FEIN',
 'LP_STATUS_GROB',
 'NATIONALITAET_KZ',
 'PRAEGENDE_JUGENDJAHRE',
 'RETOURTYP_BK_S',
 'SHOPPER_TYP',
 'SOHO_KZ',
 'TITEL_KZ',
 'VERS_TYP',
 'ALTER_HH',
 'ANZ_PERSONEN',
 'ANZ_TITEL',
 'HH_EINKOMMEN_SCORE',
 'KK_KUNDENTYP',
 'W_KEIT_KIND_HH',
 'WOHNDAUER_2008',
 'ANZ_HAUSHALTE_AKTIV',
 'ANZ_HH_TITEL',
 'GEBAEUDETYP',
 'KONSUMNAEHE',
 'MIN_GEBAEUDEJAHR',
 'OST_WEST_KZ',
 'WOHNLAGE',
 'CAMEO_DEUG_2015',
 'CAMEO_DEU_2015',
 'CAMEO_INTL_2015',
 'KBA05_ANTG1',
 'KBA05_ANTG2',
 'KBA05_ANTG3',
 'KBA05_ANTG4',
 'KBA05_BAUMAX',
 'KBA05_GBZ',
 'BALLRAUM',
 'EWDICHTE',
 'INNENSTADT',
 'GEBAEUDETYP_RASTER',
 'KKK',
 'MOBI_REGIO',
 'ONLINE_AFFINITAET',
 'REGIOTYP',
 'KBA13_ANZAHL_PKW',
 'PLZ8_ANTG1',
 'PLZ8_ANTG2',
 'PLZ8_ANTG3',
 'PLZ8_ANTG4',
 'PLZ8_BAUMAX',
 'PLZ8_HHZ',
 'PLZ8_GBZ',
 'ARBEIT',
 'ORTSGR_KLS9',
 'RELAT_AB']
In [13]:
#getting the number of missing null values for those columns
col_null_counts = {col: azdias[col].isna().sum() for col in col_null}
col_null_counts
Out[13]:
{'AGER_TYP': 685843,
 'ALTERSKATEGORIE_GROB': 2881,
 'CJT_GESAMTTYP': 4854,
 'GEBURTSJAHR': 392318,
 'GFK_URLAUBERTYP': 4854,
 'HEALTH_TYP': 111196,
 'LP_LEBENSPHASE_FEIN': 97632,
 'LP_LEBENSPHASE_GROB': 94572,
 'LP_FAMILIE_FEIN': 77792,
 'LP_FAMILIE_GROB': 77792,
 'LP_STATUS_FEIN': 4854,
 'LP_STATUS_GROB': 4854,
 'NATIONALITAET_KZ': 108315,
 'PRAEGENDE_JUGENDJAHRE': 108164,
 'RETOURTYP_BK_S': 4854,
 'SHOPPER_TYP': 111196,
 'SOHO_KZ': 73499,
 'TITEL_KZ': 889061,
 'VERS_TYP': 111196,
 'ALTER_HH': 310267,
 'ANZ_PERSONEN': 73499,
 'ANZ_TITEL': 73499,
 'HH_EINKOMMEN_SCORE': 18348,
 'KK_KUNDENTYP': 584612,
 'W_KEIT_KIND_HH': 147988,
 'WOHNDAUER_2008': 73499,
 'ANZ_HAUSHALTE_AKTIV': 99611,
 'ANZ_HH_TITEL': 97008,
 'GEBAEUDETYP': 93148,
 'KONSUMNAEHE': 73969,
 'MIN_GEBAEUDEJAHR': 93148,
 'OST_WEST_KZ': 93148,
 'WOHNLAGE': 93148,
 'CAMEO_DEUG_2015': 99352,
 'CAMEO_DEU_2015': 99352,
 'CAMEO_INTL_2015': 99352,
 'KBA05_ANTG1': 133324,
 'KBA05_ANTG2': 133324,
 'KBA05_ANTG3': 133324,
 'KBA05_ANTG4': 133324,
 'KBA05_BAUMAX': 476524,
 'KBA05_GBZ': 133324,
 'BALLRAUM': 93740,
 'EWDICHTE': 93740,
 'INNENSTADT': 93740,
 'GEBAEUDETYP_RASTER': 93155,
 'KKK': 158064,
 'MOBI_REGIO': 133324,
 'ONLINE_AFFINITAET': 4854,
 'REGIOTYP': 158064,
 'KBA13_ANZAHL_PKW': 105800,
 'PLZ8_ANTG1': 116515,
 'PLZ8_ANTG2': 116515,
 'PLZ8_ANTG3': 116515,
 'PLZ8_ANTG4': 116515,
 'PLZ8_BAUMAX': 116515,
 'PLZ8_HHZ': 116515,
 'PLZ8_GBZ': 116515,
 'ARBEIT': 97375,
 'ORTSGR_KLS9': 97274,
 'RELAT_AB': 97375}
In [14]:
# Investigate patterns in the amount of missing data in each column.
#% of missing data for each column 
p_col_null = {col: (azdias[col].isna().mean() * 100) for col in col_null}
p_col_null
Out[14]:
{'AGER_TYP': 76.955435296071343,
 'ALTERSKATEGORIE_GROB': 0.32326437550282144,
 'CJT_GESAMTTYP': 0.54464605299919999,
 'GEBURTSJAHR': 44.02028228688507,
 'GFK_URLAUBERTYP': 0.54464605299919999,
 'HEALTH_TYP': 12.476815514894735,
 'LP_LEBENSPHASE_FEIN': 10.954858559212585,
 'LP_LEBENSPHASE_GROB': 10.611509378706293,
 'LP_FAMILIE_FEIN': 8.7286991666489016,
 'LP_FAMILIE_GROB': 8.7286991666489016,
 'LP_STATUS_FEIN': 0.54464605299919999,
 'LP_STATUS_GROB': 0.54464605299919999,
 'NATIONALITAET_KZ': 12.153551139391913,
 'PRAEGENDE_JUGENDJAHRE': 12.136608091595686,
 'RETOURTYP_BK_S': 0.54464605299919999,
 'SHOPPER_TYP': 12.476815514894735,
 'SOHO_KZ': 8.2470004634091882,
 'TITEL_KZ': 99.757635872583791,
 'VERS_TYP': 12.476815514894735,
 'ALTER_HH': 34.813699407890972,
 'ANZ_PERSONEN': 8.2470004634091882,
 'ANZ_TITEL': 8.2470004634091882,
 'HH_EINKOMMEN_SCORE': 2.0587486156632306,
 'KK_KUNDENTYP': 65.596748730112949,
 'W_KEIT_KIND_HH': 16.605084485217471,
 'WOHNDAUER_2008': 8.2470004634091882,
 'ANZ_HAUSHALTE_AKTIV': 11.176913470396231,
 'ANZ_HH_TITEL': 10.884842255736793,
 'GEBAEUDETYP': 10.451728583594866,
 'KONSUMNAEHE': 8.299737102245123,
 'MIN_GEBAEUDEJAHR': 10.451728583594866,
 'OST_WEST_KZ': 10.451728583594866,
 'WOHNLAGE': 10.451728583594866,
 'CAMEO_DEUG_2015': 11.147852216229195,
 'CAMEO_DEU_2015': 11.147852216229195,
 'CAMEO_INTL_2015': 11.147852216229195,
 'KBA05_ANTG1': 14.959701353536328,
 'KBA05_ANTG2': 14.959701353536328,
 'KBA05_ANTG3': 14.959701353536328,
 'KBA05_ANTG4': 14.959701353536328,
 'KBA05_BAUMAX': 53.468668265222654,
 'KBA05_GBZ': 14.959701353536328,
 'BALLRAUM': 10.518154307405233,
 'EWDICHTE': 10.518154307405233,
 'INNENSTADT': 10.518154307405233,
 'GEBAEUDETYP_RASTER': 10.452514022896677,
 'KKK': 17.73566825736826,
 'MOBI_REGIO': 14.959701353536328,
 'ONLINE_AFFINITAET': 0.54464605299919999,
 'REGIOTYP': 17.73566825736826,
 'KBA13_ANZAHL_PKW': 11.871354018812394,
 'PLZ8_ANTG1': 13.073637178657146,
 'PLZ8_ANTG2': 13.073637178657146,
 'PLZ8_ANTG3': 13.073637178657146,
 'PLZ8_ANTG4': 13.073637178657146,
 'PLZ8_BAUMAX': 13.073637178657146,
 'PLZ8_HHZ': 13.073637178657146,
 'PLZ8_GBZ': 13.073637178657146,
 'ARBEIT': 10.92602171627464,
 'ORTSGR_KLS9': 10.91468894920564,
 'RELAT_AB': 10.92602171627464}
In [15]:
#converting to df to make it easier to merge and plot
df1 = pd.DataFrame(list(col_null_counts.items()), columns=['Attribute', 'Count'])
df2 = pd.DataFrame(list(p_col_null.items()), columns=['Attribute', 'Percentage'])
In [16]:
df2.shape
Out[16]:
(61, 2)
In [17]:
#merging into one null df
null_df = pd.merge(df1, df2, on='Attribute')
null_df.head()
Out[17]:
Attribute Count Percentage
0 AGER_TYP 685843 76.955435
1 ALTERSKATEGORIE_GROB 2881 0.323264
2 CJT_GESAMTTYP 4854 0.544646
3 GEBURTSJAHR 392318 44.020282
4 GFK_URLAUBERTYP 4854 0.544646
In [18]:
#sorting the null values by count
null_df = null_df.sort_values(by='Count', ascending=False)
null_df
Out[18]:
Attribute Count Percentage
17 TITEL_KZ 889061 99.757636
0 AGER_TYP 685843 76.955435
23 KK_KUNDENTYP 584612 65.596749
40 KBA05_BAUMAX 476524 53.468668
3 GEBURTSJAHR 392318 44.020282
19 ALTER_HH 310267 34.813699
46 KKK 158064 17.735668
49 REGIOTYP 158064 17.735668
24 W_KEIT_KIND_HH 147988 16.605084
38 KBA05_ANTG3 133324 14.959701
36 KBA05_ANTG1 133324 14.959701
37 KBA05_ANTG2 133324 14.959701
41 KBA05_GBZ 133324 14.959701
39 KBA05_ANTG4 133324 14.959701
47 MOBI_REGIO 133324 14.959701
52 PLZ8_ANTG2 116515 13.073637
51 PLZ8_ANTG1 116515 13.073637
53 PLZ8_ANTG3 116515 13.073637
54 PLZ8_ANTG4 116515 13.073637
55 PLZ8_BAUMAX 116515 13.073637
56 PLZ8_HHZ 116515 13.073637
57 PLZ8_GBZ 116515 13.073637
18 VERS_TYP 111196 12.476816
15 SHOPPER_TYP 111196 12.476816
5 HEALTH_TYP 111196 12.476816
12 NATIONALITAET_KZ 108315 12.153551
13 PRAEGENDE_JUGENDJAHRE 108164 12.136608
50 KBA13_ANZAHL_PKW 105800 11.871354
26 ANZ_HAUSHALTE_AKTIV 99611 11.176913
35 CAMEO_INTL_2015 99352 11.147852
... ... ... ...
34 CAMEO_DEU_2015 99352 11.147852
6 LP_LEBENSPHASE_FEIN 97632 10.954859
58 ARBEIT 97375 10.926022
60 RELAT_AB 97375 10.926022
59 ORTSGR_KLS9 97274 10.914689
27 ANZ_HH_TITEL 97008 10.884842
7 LP_LEBENSPHASE_GROB 94572 10.611509
44 INNENSTADT 93740 10.518154
42 BALLRAUM 93740 10.518154
43 EWDICHTE 93740 10.518154
45 GEBAEUDETYP_RASTER 93155 10.452514
32 WOHNLAGE 93148 10.451729
31 OST_WEST_KZ 93148 10.451729
30 MIN_GEBAEUDEJAHR 93148 10.451729
28 GEBAEUDETYP 93148 10.451729
9 LP_FAMILIE_GROB 77792 8.728699
8 LP_FAMILIE_FEIN 77792 8.728699
29 KONSUMNAEHE 73969 8.299737
25 WOHNDAUER_2008 73499 8.247000
21 ANZ_TITEL 73499 8.247000
20 ANZ_PERSONEN 73499 8.247000
16 SOHO_KZ 73499 8.247000
22 HH_EINKOMMEN_SCORE 18348 2.058749
14 RETOURTYP_BK_S 4854 0.544646
11 LP_STATUS_GROB 4854 0.544646
10 LP_STATUS_FEIN 4854 0.544646
4 GFK_URLAUBERTYP 4854 0.544646
2 CJT_GESAMTTYP 4854 0.544646
48 ONLINE_AFFINITAET 4854 0.544646
1 ALTERSKATEGORIE_GROB 2881 0.323264

61 rows × 3 columns

In [19]:
#sorting null values by precentage
null_df.sort_values(by='Percentage', ascending=False)
Out[19]:
Attribute Count Percentage
17 TITEL_KZ 889061 99.757636
0 AGER_TYP 685843 76.955435
23 KK_KUNDENTYP 584612 65.596749
40 KBA05_BAUMAX 476524 53.468668
3 GEBURTSJAHR 392318 44.020282
19 ALTER_HH 310267 34.813699
46 KKK 158064 17.735668
49 REGIOTYP 158064 17.735668
24 W_KEIT_KIND_HH 147988 16.605084
41 KBA05_GBZ 133324 14.959701
47 MOBI_REGIO 133324 14.959701
39 KBA05_ANTG4 133324 14.959701
38 KBA05_ANTG3 133324 14.959701
37 KBA05_ANTG2 133324 14.959701
36 KBA05_ANTG1 133324 14.959701
55 PLZ8_BAUMAX 116515 13.073637
57 PLZ8_GBZ 116515 13.073637
56 PLZ8_HHZ 116515 13.073637
52 PLZ8_ANTG2 116515 13.073637
54 PLZ8_ANTG4 116515 13.073637
53 PLZ8_ANTG3 116515 13.073637
51 PLZ8_ANTG1 116515 13.073637
18 VERS_TYP 111196 12.476816
15 SHOPPER_TYP 111196 12.476816
5 HEALTH_TYP 111196 12.476816
12 NATIONALITAET_KZ 108315 12.153551
13 PRAEGENDE_JUGENDJAHRE 108164 12.136608
50 KBA13_ANZAHL_PKW 105800 11.871354
26 ANZ_HAUSHALTE_AKTIV 99611 11.176913
34 CAMEO_DEU_2015 99352 11.147852
... ... ... ...
35 CAMEO_INTL_2015 99352 11.147852
6 LP_LEBENSPHASE_FEIN 97632 10.954859
58 ARBEIT 97375 10.926022
60 RELAT_AB 97375 10.926022
59 ORTSGR_KLS9 97274 10.914689
27 ANZ_HH_TITEL 97008 10.884842
7 LP_LEBENSPHASE_GROB 94572 10.611509
44 INNENSTADT 93740 10.518154
42 BALLRAUM 93740 10.518154
43 EWDICHTE 93740 10.518154
45 GEBAEUDETYP_RASTER 93155 10.452514
32 WOHNLAGE 93148 10.451729
31 OST_WEST_KZ 93148 10.451729
30 MIN_GEBAEUDEJAHR 93148 10.451729
28 GEBAEUDETYP 93148 10.451729
8 LP_FAMILIE_FEIN 77792 8.728699
9 LP_FAMILIE_GROB 77792 8.728699
29 KONSUMNAEHE 73969 8.299737
25 WOHNDAUER_2008 73499 8.247000
21 ANZ_TITEL 73499 8.247000
20 ANZ_PERSONEN 73499 8.247000
16 SOHO_KZ 73499 8.247000
22 HH_EINKOMMEN_SCORE 18348 2.058749
14 RETOURTYP_BK_S 4854 0.544646
11 LP_STATUS_GROB 4854 0.544646
10 LP_STATUS_FEIN 4854 0.544646
4 GFK_URLAUBERTYP 4854 0.544646
2 CJT_GESAMTTYP 4854 0.544646
48 ONLINE_AFFINITAET 4854 0.544646
1 ALTERSKATEGORIE_GROB 2881 0.323264

61 rows × 3 columns

In [20]:
#histogram of count column
null_df['Count'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Count');
In [21]:
#histogram of percentage column
null_df['Percentage'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Percentage');
In [22]:
# Investigate patterns in the amount of missing data in each column.
null_df.plot.bar(x='Attribute', y='Percentage', figsize=(18, 8))
plt.xlabel('Attribute')
plt.ylabel('Percentage')
plt.title('Bar Graph of Percentage');
In [23]:
# Removed the outlier columns from the dataset.

outliers = null_df.loc[null_df['Percentage'] > 20, 'Attribute'].tolist()
outliers
Out[23]:
['TITEL_KZ',
 'AGER_TYP',
 'KK_KUNDENTYP',
 'KBA05_BAUMAX',
 'GEBURTSJAHR',
 'ALTER_HH']
In [24]:
#dropping the columns
azdias = azdias.drop(columns=outliers)
azdias.shape #shows that 6 columns were dropped
Out[24]:
(891221, 79)
In [25]:
#dropping outlier rows from null df
null_df = null_df[~null_df['Attribute'].isin(outliers)]

#looking at the plot again of the remaining columns with nulls
null_df.plot.bar(x='Attribute', y='Percentage', figsize=(18, 8))
plt.xlabel('Attribute')
plt.ylabel('Percentage')
plt.title('Bar Graph of Percentage');

Discussion 1.1.2: Assess Missing Data in Each Column

Among the 85 columns in this dataset, 6 columns exhibited null values exceeding 20%. The majority of columns with missing values reported nulls in 8% to 18% of the fields. Additionally, several columns shared identical percentages of null values with each other.

Step 1.1.3: Assess Missing Data in Each Row

Next, I assessed missing data at the row level. I divided the dataset into two subsets: one with rows that exceed a threshold of missing values and another with rows below that threshold.

To determine how to handle outlier rows, I compared the distributions of non-missing data across both subsets. I selected at least five columns with little to no missing data and used Seaborn's countplot() and Matplotlib's subplot() to create side-by-side comparisons.

Based on the comparison results, I decided whether dropping the rows with many missing values would affect the analysis. Regardless, I continued the analysis using the subset with fewer missing values.

In [26]:
# How much data is missing in each row of the dataset?
azdias['row_null_count'] =   azdias.isnull().sum(axis=1)
In [27]:
azdias['row_null_count'].describe()
Out[27]:
count    891221.000000
mean          5.649894
std          13.234687
min           0.000000
25%           0.000000
50%           0.000000
75%           3.000000
max          49.000000
Name: row_null_count, dtype: float64
In [28]:
azdias['row_null_count'].value_counts()
Out[28]:
0     623209
47     45578
2      27926
43     27369
8      24592
5      22515
3      17629
1      15738
6      13771
7      13714
4      12607
34     10816
10      5410
15      4743
35      3911
13      3255
9       3042
33      2985
16      2505
14      2243
19      1180
11      1127
12       766
17       677
37       538
45       494
18       428
38       421
41       356
20       349
32       206
44       155
21       150
40       137
23       132
22       129
36        84
39        77
24        69
26        59
25        55
27        24
42        21
29        12
30         6
28         5
31         3
49         2
48         1
Name: row_null_count, dtype: int64
In [29]:
azdias['row_null_count'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Null Counts Per Row');
In [30]:
#divide the data into two subsets based on the number of missing values in each row.

data_low = azdias.query('row_null_count <= 25')
data_high = azdias.query('row_null_count > 25')
In [31]:
#number of rows in the low subset
data_low.shape
Out[31]:
(797961, 80)
In [32]:
#number of rows in the high subset
data_high.shape
Out[32]:
(93260, 80)
In [33]:
#percentage of rows in the high subset
93269/8373929 * 100
Out[33]:
1.11380213517454
In [34]:
#compare the distribution of values for  columns where there are no or few missing values, between the two subsets.
#finding which columns have no null values
col_not_null = [col for col in azdias.columns if azdias[col].notnull().all()]
col_not_null
Out[34]:
['ANREDE_KZ',
 'FINANZ_MINIMALIST',
 'FINANZ_SPARER',
 'FINANZ_VORSORGER',
 'FINANZ_ANLEGER',
 'FINANZ_UNAUFFAELLIGER',
 'FINANZ_HAUSBAUER',
 'FINANZTYP',
 'GREEN_AVANTGARDE',
 'SEMIO_SOZ',
 'SEMIO_FAM',
 'SEMIO_REL',
 'SEMIO_MAT',
 'SEMIO_VERT',
 'SEMIO_LUST',
 'SEMIO_ERL',
 'SEMIO_KULT',
 'SEMIO_RAT',
 'SEMIO_KRIT',
 'SEMIO_DOM',
 'SEMIO_KAEM',
 'SEMIO_PFLICHT',
 'SEMIO_TRADV',
 'ZABEOTYP',
 'row_null_count']
In [35]:
#number of columns with no null
len(col_not_null)
Out[35]:
25
In [36]:
def compare_column_distribution(data1, data2, column):
    '''
    This function creates the same graph for two data sets side by side.
    
    data1 - first data set
    data2 - second data set
    column - name of the column
    '''
    
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    sns.countplot(x=column, data=data1)
    plt.title(f'Distribution in {column} data1')
    
    plt.subplot(1, 2, 2)
    sns.countplot(x=column, data=data2)
    plt.title(f'Distribution in {column} data2')
    
    plt.show();
In [37]:
#running a loop to input all of the columns through the function
for col in col_not_null:
    compare_column_distribution(data_low, data_high, col)

Discussion 1.1.3: Assess Missing Data in Each Row

The dataset exhibiting minimal or no missing values (25 or fewer) starkly differs from the dataset with a substantial number of missing values. The segregation of data induces notable changes in the distribution.

Step 1.2: Select and Re-Encode Features

It's essential to prepare the dataset for analysis by ensuring all features are encoded numerically. Although most values are numbers, not all represent numeric data. Here's how I approached this:

  • Numeric and Interval Data: These features were retained without changes.
  • Ordinal Data: I assumed these could be treated as interval data and kept them unchanged.
  • Categorical and Mixed-Type Data: I carefully examined these features, deciding whether to keep, drop, or re-encode them based on their nature.
  • After making these decisions, I created a new dataframe with the selected and engineered columns, ready for machine learning.

In [38]:
#creating copy of feat_info df

copy_feat = feat_info.copy()
copy_feat.shape
Out[38]:
(85, 4)
In [39]:
#dropping rows in new df to match the previous dropped columns

copy_feat = copy_feat[~copy_feat['attribute'].isin(outliers)]
copy_feat.shape
Out[39]:
(79, 4)
In [40]:
#how many features are there of each data type?

copy_feat.type.value_counts()
Out[40]:
ordinal        49
categorical    18
mixed           6
numeric         6
Name: type, dtype: int64
In [41]:
#creating the list
cat_to_check = []
mix_to_check = []
In [42]:
#appending to the list all the columns that will need to be checked
for index, row in copy_feat.iterrows():
    if row['type'] == 'categorical':
        cat_to_check.append(row['attribute'])
    elif row['type'] == 'mixed':
        mix_to_check.append(row['attribute'])
           
In [43]:
#verifying that the correct total has been added to the list

(len(cat_to_check), len(mix_to_check))
Out[43]:
(18, 6)
In [44]:
#assessing the categorical variables: which are binary, which are multi-level, and which one needs to be re-encoded?

#create empty lists
binary_feat = []
multi_feat = []
In [45]:
#seperating binary and multi-level categorical variables

for i in cat_to_check:
    if (data_low[i].nunique() > 2):
        multi_feat.append(i)
    else:
        binary_feat.append(i)
In [46]:
#checking the binary to find the one with non-numeric values
data_low[binary_feat].head()
Out[46]:
ANREDE_KZ GREEN_AVANTGARDE SOHO_KZ VERS_TYP OST_WEST_KZ
1 2 0 1.0 2.0 W
2 2 1 0.0 1.0 W
3 2 0 0.0 1.0 W
4 1 0 0.0 2.0 W
5 2 0 0.0 2.0 W
In [47]:
#checking the values
data_low.OST_WEST_KZ.unique()
Out[47]:
array(['W', 'O'], dtype=object)
In [48]:
#re-encode the values as numbers
data_low['OST_WEST_KZ'].replace('O', 1, inplace = True)
data_low['OST_WEST_KZ'].replace('W', 0, inplace = True)
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py:5890: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
In [49]:
#checking that it worked
data_low.OST_WEST_KZ.unique()
Out[49]:
array([0, 1])
In [50]:
data_low[multi_feat].head()
Out[50]:
CJT_GESAMTTYP FINANZTYP GFK_URLAUBERTYP LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ SHOPPER_TYP ZABEOTYP GEBAEUDETYP CAMEO_DEUG_2015 CAMEO_DEU_2015
1 5.0 1 10.0 5.0 3.0 2.0 1.0 1.0 3.0 5 8.0 8 8A
2 3.0 1 10.0 1.0 1.0 3.0 2.0 1.0 2.0 5 1.0 4 4C
3 2.0 6 1.0 NaN NaN 9.0 4.0 1.0 1.0 3 1.0 2 2A
4 5.0 5 5.0 10.0 5.0 3.0 2.0 1.0 2.0 4 1.0 6 6B
5 2.0 2 1.0 1.0 1.0 4.0 2.0 1.0 0.0 4 1.0 8 8C
In [51]:
# Re-encode categorical variable(s) to be kept in the analysis.

data_low= data_low.drop(multi_feat, axis = 1)
data_low.shape
Out[51]:
(797961, 67)

Discussion 1.2.1: Re-Encode Categorical Features

To streamline the procedure all multi-feature category variables were excluded and dropped. Additionally, a binary variable was transformed into numeric values.

In [52]:
#checking which columns are mixed variables
mix_to_check
Out[52]:
['LP_LEBENSPHASE_FEIN',
 'LP_LEBENSPHASE_GROB',
 'PRAEGENDE_JUGENDJAHRE',
 'WOHNLAGE',
 'CAMEO_INTL_2015',
 'PLZ8_BAUMAX']
In [53]:
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.

data_low.PRAEGENDE_JUGENDJAHRE.value_counts()
Out[53]:
14.0    182968
8.0     141612
10.0     85800
5.0      84687
3.0      53841
15.0     42543
11.0     35751
9.0      33570
6.0      25652
12.0     24446
1.0      20678
4.0      20451
2.0       7479
13.0      5764
7.0       4010
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64
In [54]:
'''
From the data dictionary -
-  1: 40s - war years (Mainstream, E+W)
-  2: 40s - reconstruction years (Avantgarde, E+W)
-  3: 50s - economic miracle (Mainstream, E+W)
-  4: 50s - milk bar / Individualisation (Avantgarde, E+W)
-  5: 60s - economic miracle (Mainstream, E+W)
-  6: 60s - generation 68 / student protestors (Avantgarde, W)
-  7: 60s - opponents to the building of the Wall (Avantgarde, E)
-  8: 70s - family orientation (Mainstream, E+W)
-  9: 70s - peace movement (Avantgarde, E+W)
- 10: 80s - Generation Golf (Mainstream, W)
- 11: 80s - ecological awareness (Avantgarde, W)
- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
- 13: 80s - Swords into ploughshares (Avantgarde, E)
- 14: 90s - digital media kids (Mainstream, E+W)
- 15: 90s - ecological awareness (Avantgarde, E+W)
'''
# will code Avantgard = 0 and Mainstream = 1
# 40s = 0, 50s = 1, 60s = 2, 70s = 3, 80s = 4, 90s = 5
Out[54]:
'\nFrom the data dictionary -\n-  1: 40s - war years (Mainstream, E+W)\n-  2: 40s - reconstruction years (Avantgarde, E+W)\n-  3: 50s - economic miracle (Mainstream, E+W)\n-  4: 50s - milk bar / Individualisation (Avantgarde, E+W)\n-  5: 60s - economic miracle (Mainstream, E+W)\n-  6: 60s - generation 68 / student protestors (Avantgarde, W)\n-  7: 60s - opponents to the building of the Wall (Avantgarde, E)\n-  8: 70s - family orientation (Mainstream, E+W)\n-  9: 70s - peace movement (Avantgarde, E+W)\n- 10: 80s - Generation Golf (Mainstream, W)\n- 11: 80s - ecological awareness (Avantgarde, W)\n- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)\n- 13: 80s - Swords into ploughshares (Avantgarde, E)\n- 14: 90s - digital media kids (Mainstream, E+W)\n- 15: 90s - ecological awareness (Avantgarde, E+W)\n'
In [55]:
decade = {0: [1, 2], 1: [3, 4], 2: [5, 6, 7], 3: [8, 9], 4: [10, 11, 12, 13], 5: [14, 15]}
movement = {0: [2, 4, 6, 7, 9, 11, 13, 15], 1: [1, 3, 5, 8, 10, 12, 14]}
In [56]:
#creating new columns 
data_low['DECADE'] = data_low['PRAEGENDE_JUGENDJAHRE']
data_low['MOVEMENT'] = data_low['PRAEGENDE_JUGENDJAHRE']
In [57]:
#checking that the columns populated correctly
data_low.head()
Out[57]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB row_null_count DECADE MOVEMENT
1 1.0 2 1 5 2 5 4 5 0 3.0 ... 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 14.0 14.0
2 3.0 2 1 4 1 2 3 5 1 3.0 ... 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 15.0 15.0
3 4.0 2 4 2 5 2 1 2 0 2.0 ... 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 8.0 8.0
4 3.0 1 4 3 4 1 3 2 0 3.0 ... 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 8.0 8.0
5 1.0 2 3 1 5 2 2 5 0 3.0 ... 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 3.0 3.0

5 rows × 69 columns

In [58]:
#creating the maps for the lists in the dict
decade_map = {value: key for key, values in decade.items() for value in values}
movement_map= {value: key for key, values in movement.items() for value in values}
In [59]:
#replacing and checking decade column
data_low['DECADE'] = data_low['DECADE'].replace(decade_map)
data_low['DECADE'].head()
Out[59]:
1    5.0
2    5.0
3    3.0
4    3.0
5    1.0
Name: DECADE, dtype: float64
In [60]:
#drop old column
data_low = data_low.drop(columns = 'PRAEGENDE_JUGENDJAHRE')
In [61]:
#replacing and checking decade column
data_low['MOVEMENT'] = data_low['MOVEMENT'].replace(movement_map)
data_low['MOVEMENT'].head()
Out[61]:
1    1.0
2    0.0
3    1.0
4    1.0
5    1.0
Name: MOVEMENT, dtype: float64
In [62]:
# Investigate "CAMEO_INTL_2015" and engineer two new variables.
data_low['CAMEO_INTL_2015'].head()
Out[62]:
1    51
2    24
3    12
4    43
5    54
Name: CAMEO_INTL_2015, dtype: object
In [63]:
'''
From the data dictionary 
German CAMEO: Wealth / Life Stage Typology, mapped to international code
- -1: unknown
- 11: Wealthy Households - Pre-Family Couples & Singles
- 12: Wealthy Households - Young Couples With Children
- 13: Wealthy Households - Families With School Age Children
- 14: Wealthy Households - Older Families &  Mature Couples
- 15: Wealthy Households - Elders In Retirement
- 21: Prosperous Households - Pre-Family Couples & Singles
- 22: Prosperous Households - Young Couples With Children
- 23: Prosperous Households - Families With School Age Children
- 24: Prosperous Households - Older Families & Mature Couples
- 25: Prosperous Households - Elders In Retirement
- 31: Comfortable Households - Pre-Family Couples & Singles
- 32: Comfortable Households - Young Couples With Children
- 33: Comfortable Households - Families With School Age Children
- 34: Comfortable Households - Older Families & Mature Couples
- 35: Comfortable Households - Elders In Retirement
- 41: Less Affluent Households - Pre-Family Couples & Singles
- 42: Less Affluent Households - Young Couples With Children
- 43: Less Affluent Households - Families With School Age Children
- 44: Less Affluent Households - Older Families & Mature Couples
- 45: Less Affluent Households - Elders In Retirement
- 51: Poorer Households - Pre-Family Couples & Singles
- 52: Poorer Households - Young Couples With Children
- 53: Poorer Households - Families With School Age Children
- 54: Poorer Households - Older Families & Mature Couples
- 55: Poorer Households - Elders In Retirement
- XX: unknown
'''
Out[63]:
'\nFrom the data dictionary \nGerman CAMEO: Wealth / Life Stage Typology, mapped to international code\n- -1: unknown\n- 11: Wealthy Households - Pre-Family Couples & Singles\n- 12: Wealthy Households - Young Couples With Children\n- 13: Wealthy Households - Families With School Age Children\n- 14: Wealthy Households - Older Families &  Mature Couples\n- 15: Wealthy Households - Elders In Retirement\n- 21: Prosperous Households - Pre-Family Couples & Singles\n- 22: Prosperous Households - Young Couples With Children\n- 23: Prosperous Households - Families With School Age Children\n- 24: Prosperous Households - Older Families & Mature Couples\n- 25: Prosperous Households - Elders In Retirement\n- 31: Comfortable Households - Pre-Family Couples & Singles\n- 32: Comfortable Households - Young Couples With Children\n- 33: Comfortable Households - Families With School Age Children\n- 34: Comfortable Households - Older Families & Mature Couples\n- 35: Comfortable Households - Elders In Retirement\n- 41: Less Affluent Households - Pre-Family Couples & Singles\n- 42: Less Affluent Households - Young Couples With Children\n- 43: Less Affluent Households - Families With School Age Children\n- 44: Less Affluent Households - Older Families & Mature Couples\n- 45: Less Affluent Households - Elders In Retirement\n- 51: Poorer Households - Pre-Family Couples & Singles\n- 52: Poorer Households - Young Couples With Children\n- 53: Poorer Households - Families With School Age Children\n- 54: Poorer Households - Older Families & Mature Couples\n- 55: Poorer Households - Elders In Retirement\n- XX: unknown\n'
In [64]:
#creating new columns
data_low['WEALTH'] = data_low['CAMEO_INTL_2015']
data_low['LIFE_STAGE'] = data_low['CAMEO_INTL_2015']
data_low.head()
Out[64]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB row_null_count DECADE MOVEMENT WEALTH LIFE_STAGE
1 1.0 2 1 5 2 5 4 5 0 3.0 ... 5.0 4.0 3.0 5.0 4.0 0 5.0 1.0 51 51
2 3.0 2 1 4 1 2 3 5 1 3.0 ... 4.0 4.0 3.0 5.0 2.0 0 5.0 0.0 24 24
3 4.0 2 4 2 5 2 1 2 0 2.0 ... 3.0 4.0 2.0 3.0 3.0 7 3.0 1.0 12 12
4 3.0 1 4 3 4 1 3 2 0 3.0 ... 3.0 3.0 4.0 6.0 5.0 0 3.0 1.0 43 43
5 1.0 2 3 1 5 2 2 5 0 3.0 ... 5.0 5.0 2.0 3.0 3.0 0 1.0 1.0 54 54

5 rows × 70 columns

In [65]:
#slicing the strings
data_low['WEALTH']= data_low['WEALTH'].astype(str).str[-1]
In [66]:
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(str).str[1:]
data_low['LIFE_STAGE'].head()
Out[66]:
1    1
2    4
3    2
4    3
5    4
Name: LIFE_STAGE, dtype: object
In [67]:
#checking the values
data_low['LIFE_STAGE'].unique()
Out[67]:
array(['1', '4', '2', '3', '5', 'an'], dtype=object)
In [68]:
#creating masks to fix the null values
m = data_low['LIFE_STAGE'].str.contains('an', case=False, regex=True)
ma = data_low['WEALTH'].str.contains('n', case=False, regex=True)
In [69]:
#fixing the null values
data_low.loc[m, 'LIFE_STAGE'] = np.nan
data_low.loc[ma, 'WEALTH'] = np.nan
In [70]:
#double checking 
data_low['LIFE_STAGE'].unique()
Out[70]:
array(['1', '4', '2', '3', '5', nan], dtype=object)
In [71]:
data_low['WEALTH'].unique()
Out[71]:
array(['1', '4', '2', '3', '5', nan], dtype=object)
In [72]:
#changing the type
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(float)
data_low['WEALTH']= data_low['WEALTH'].astype(float)
In [73]:
#dropping the orginal column
data_low = data_low.drop(columns = 'CAMEO_INTL_2015')
In [74]:
#dropping mixed columns that have duplicate info
data_low = data_low.drop(columns = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB'])
In [75]:
#checking the data type of this column 
data_low['WOHNLAGE'].describe()
Out[75]:
count    797961.000000
mean          4.053006
std           1.949352
min           0.000000
25%           3.000000
50%           3.000000
75%           5.000000
max           8.000000
Name: WOHNLAGE, dtype: float64
In [76]:
data_low['PLZ8_BAUMAX'].describe()
Out[76]:
count    774706.000000
mean          1.943913
std           1.459654
min           1.000000
25%           1.000000
50%           1.000000
75%           3.000000
max           5.000000
Name: PLZ8_BAUMAX, dtype: float64

Discussion 1.2.2: Engineer Mixed-Type Features

We generated two new columns, 'MOVEMENT' and 'DECADE,' from the original mixed feature "PRAEGENDE_JUGENDJAHRE." These new columns were created using data from the original feature and information from a data dictionary. In this process, we devised two dictionaries to specify the new values for these columns. 'MOVEMENT' was assigned a binary variable, 1 or 0, to distinguish between mainstream and avant-garde, while 'DECADE' was assigned an interval-type variable.

After verifying that the data had been correctly transferred, we removed the original "PRAEGENDE_JUGENDJAHRE" column.

Similarly, for the "CAMEO_INTL_2015" column, which contained two-digit integers representing 'WEALTH' and 'LIFE_STAGE,' we separated the tens-place to represent wealth and the ones-place value to represent life stage. Once we ensured the data was accurate, we addressed any null values and subsequently dropped the original "CAMEO_INTL_2015" column.

Per the data dictionary, the columns 'WOHNLAGE' and 'PLZ8_BAUMAX' are already an interval-type variable so these did not need to be changed. Columns 'LP_LEBENSPHASE_FEIN' and 'LP_LEBENSPHASE_GROB' include life stage infomation. These columns were dropped since we already have that information.

In [77]:
 data_low.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 797961 entries, 1 to 891220
Data columns (total 67 columns):
ALTERSKATEGORIE_GROB     795160 non-null float64
ANREDE_KZ                797961 non-null int64
FINANZ_MINIMALIST        797961 non-null int64
FINANZ_SPARER            797961 non-null int64
FINANZ_VORSORGER         797961 non-null int64
FINANZ_ANLEGER           797961 non-null int64
FINANZ_UNAUFFAELLIGER    797961 non-null int64
FINANZ_HAUSBAUER         797961 non-null int64
GREEN_AVANTGARDE         797961 non-null int64
HEALTH_TYP               761281 non-null float64
RETOURTYP_BK_S           793248 non-null float64
SEMIO_SOZ                797961 non-null int64
SEMIO_FAM                797961 non-null int64
SEMIO_REL                797961 non-null int64
SEMIO_MAT                797961 non-null int64
SEMIO_VERT               797961 non-null int64
SEMIO_LUST               797961 non-null int64
SEMIO_ERL                797961 non-null int64
SEMIO_KULT               797961 non-null int64
SEMIO_RAT                797961 non-null int64
SEMIO_KRIT               797961 non-null int64
SEMIO_DOM                797961 non-null int64
SEMIO_KAEM               797961 non-null int64
SEMIO_PFLICHT            797961 non-null int64
SEMIO_TRADV              797961 non-null int64
SOHO_KZ                  797961 non-null float64
VERS_TYP                 761281 non-null float64
ANZ_PERSONEN             797961 non-null float64
ANZ_TITEL                797961 non-null float64
HH_EINKOMMEN_SCORE       797961 non-null float64
W_KEIT_KIND_HH           738718 non-null float64
WOHNDAUER_2008           797961 non-null float64
ANZ_HAUSHALTE_AKTIV      791535 non-null float64
ANZ_HH_TITEL             794138 non-null float64
KONSUMNAEHE              797891 non-null float64
MIN_GEBAEUDEJAHR         797961 non-null float64
OST_WEST_KZ              797961 non-null int64
WOHNLAGE                 797961 non-null float64
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797369 non-null float64
EWDICHTE                 797369 non-null float64
INNENSTADT               797369 non-null float64
GEBAEUDETYP_RASTER       797954 non-null float64
KKK                      733146 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        793248 non-null float64
REGIOTYP                 733146 non-null float64
KBA13_ANZAHL_PKW         785412 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   793734 non-null float64
ORTSGR_KLS9              793835 non-null float64
RELAT_AB                 793734 non-null float64
row_null_count           797961 non-null int64
DECADE                   769252 non-null float64
MOVEMENT                 769252 non-null float64
WEALTH                   791840 non-null float64
LIFE_STAGE               791840 non-null float64
dtypes: float64(43), int64(24)
memory usage: 434.0 MB
In [78]:
#dropping row_null_count since it is no longer needed
data_low = data_low.drop(columns = 'row_null_count')
In [79]:
def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    # convert missing value codes into NaNs, ...
    for attribute, miss in zip(feat_info['attribute'], feat_info['missing_or_unknown']):
        missing_values = miss.strip('[]').split(',')
        miss_list = ['X', 'XX', ''] #including X and XX due to errors when running the following int conversion
        missing_values = [int(value) if (value not in miss_list) else value for value in missing_values]
        if missing_values != ['']:
            df[attribute] = df[attribute].replace(missing_values, np.nan)
    
    # remove selected columns and rows, ...
    outliers = ['TITEL_KZ',  'AGER_TYP',  'KK_KUNDENTYP',  'KBA05_BAUMAX',  'GEBURTSJAHR',  'ALTER_HH']
    
    df = df.drop(columns=outliers)
    
    df['row_null_count'] =   df.isnull().sum(axis=1)
          
    data_low = df.query('row_null_count <= 25')
        
    # select, re-encode, and engineer column values.
   
    #### correcting column 'OST_WEST_KZ'
    data_low['OST_WEST_KZ'].replace('O', 1, inplace = True)
    data_low['OST_WEST_KZ'].replace('W', 0, inplace = True)
    
    #### dropping multi-feature columns
    
    multi_feat = ['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN',
                  'LP_STATUS_GROB', 'NATIONALITAET_KZ', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'CAMEO_DEUG_2015',
                  'CAMEO_DEU_2015']
    
    data_low= data_low.drop(multi_feat, axis = 1)
    
    ###correcting "PRAEGENDE_JUGENDJAHRE"
    
    decade = {0: [1, 2], 1: [3, 4], 2: [5, 6, 7], 3: [8, 9], 4: [10, 11, 12, 13], 5: [14, 15]}
    movement = {0: [2, 4, 6, 7, 9, 11, 13, 15], 1: [1, 3, 5, 8, 10, 12, 14]}
    
    data_low['DECADE'] = data_low['PRAEGENDE_JUGENDJAHRE']
    data_low['MOVEMENT'] = data_low['PRAEGENDE_JUGENDJAHRE']
    
    decade_map = {value: key for key, values in decade.items() for value in values}
    movement_map= {value: key for key, values in movement.items() for value in values}
    data_low['DECADE'] = data_low['DECADE'].replace(decade_map)
    data_low['MOVEMENT'] = data_low['MOVEMENT'].replace(movement_map)
    
    data_low = data_low.drop(columns = 'PRAEGENDE_JUGENDJAHRE')
    
    
    ####correcting  'CAMEO_INTL_2015'
    
    data_low['WEALTH'] = data_low['CAMEO_INTL_2015']
    data_low['LIFE_STAGE'] = data_low['CAMEO_INTL_2015']
    
    data_low['WEALTH']= data_low['WEALTH'].astype(str).str[-1]
    data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(str).str[1:]
    
    m = data_low['LIFE_STAGE'].str.contains('an', case=False, regex=True)
    ma = data_low['WEALTH'].str.contains('n', case=False, regex=True)
    
    data_low.loc[m, 'LIFE_STAGE'] = np.nan
    data_low.loc[ma, 'WEALTH'] = np.nan
    
    data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(float)
    data_low['WEALTH']= data_low['WEALTH'].astype(float)
    
    data_low = data_low.drop(columns = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'CAMEO_INTL_2015', 'row_null_count'])
    
    ####fill nulls
    fill_null = Imputer(strategy='median')
    new_df = pd.DataFrame(fill_null.fit_transform(data_low), columns=data_low.columns)
    
          
    
    # Return the cleaned dataframe.
    return new_df
    

Step 2: Feature Transformation

Step 2.1: Apply Feature Scaling

Before applying dimensionality reduction, I performed feature scaling to ensure that the principal component vectors are not influenced by differences in feature scales. Using the `StandardScaler` from `sklearn`, I scaled each feature to have a mean of 0 and a standard deviation of 1.

Before scaling, I made sure the data was free of missing values. I considered different approaches, such as removing data points with missing values or applying imputation, and selected the method that best suited the amount of missing data in the dataset.

Finally, I used the `.fit_transform()` method to both fit the scaler and transform the data, ensuring that the scaling parameters are saved for later use on customer demographics data.

In [80]:
(data_low.shape[0], data_low.isna().sum().sum())
Out[80]:
(797961, 784520)
In [81]:
#will fill in the null values with the median value for the column
fill_null = Imputer(strategy='median')
In [82]:
#creating a new df with the null values filled in 
#used https://stackoverflow.com/questions/33660836/impute-entire-dataframe-all-columns-using-scikit-learn-sklearn-without-itera
# for syntax
df = pd.DataFrame(fill_null.fit_transform(data_low), columns=data_low.columns)
In [83]:
##checking to make sure the column names transfered over like expected
df.head()
Out[83]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB DECADE MOVEMENT WEALTH LIFE_STAGE
0 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 ... 1.0 5.0 4.0 3.0 5.0 4.0 5.0 1.0 1.0 1.0
1 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 ... 1.0 4.0 4.0 3.0 5.0 2.0 5.0 0.0 4.0 4.0
2 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 ... 1.0 3.0 4.0 2.0 3.0 3.0 3.0 1.0 2.0 2.0
3 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 ... 2.0 3.0 3.0 4.0 6.0 5.0 3.0 1.0 3.0 3.0
4 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 ... 1.0 5.0 5.0 2.0 3.0 3.0 1.0 1.0 4.0 4.0

5 rows × 66 columns

In [84]:
#checking to make sure there are no null values
df.isna().sum().sum()
Out[84]:
0
In [151]:
# Apply feature scaling to the general population demographics data.

s = StandardScaler()
s_df = s.fit(df)
s_df = s.transform(df)
s_df = pd.DataFrame(s_df, columns=df.columns)

Discussion 2.1: Apply Feature Scaling

Several fields within the dataset contained null values, making it impractical to remove all of them. To address this issue, an imputer was employed to replace the null values with the median value from the respective column in which the null value is located. This approach was chosen in an effort to minimize the potential impact on the subsequent analysis.

Step 2.2: Perform Dimensionality Reduction

Now that the data is scaled, I applied Principal Component Analysis (PCA) using `sklearn`'s PCA class. Initially, I computed all components or set the number of components to at least half the number of features to capture the general trend in variability.

I examined the explained variance ratio for each principal component and plotted the cumulative variance using Matplotlib's `plot()` function. Based on the results, I selected the optimal number of components to retain for clustering.

Finally, I re-fitted the PCA instance to transform the data according to the chosen number of components. This reduced dataset will be used for the next steps in the project.

In [86]:
# Apply PCA to the data.

pca = PCA()
fit_pca = pca.fit(s_df)
In [87]:
# Investigate the variance accounted for by each principal component.

fit_pca.explained_variance_ratio_
Out[87]:
array([  1.67540868e-01,   1.30954918e-01,   9.20538111e-02,
         5.42892085e-02,   3.67998700e-02,   3.11986776e-02,
         2.82075284e-02,   2.60515088e-02,   2.41331331e-02,
         2.25180962e-02,   2.04073361e-02,   1.98856745e-02,
         1.93761885e-02,   1.82758532e-02,   1.74221114e-02,
         1.59129820e-02,   1.51718009e-02,   1.51053854e-02,
         1.38727726e-02,   1.26732798e-02,   1.24719146e-02,
         1.19588949e-02,   1.12591296e-02,   1.11836029e-02,
         1.07881928e-02,   9.75875336e-03,   8.34306738e-03,
         7.63627770e-03,   7.51405976e-03,   7.20694726e-03,
         6.83566326e-03,   6.58537886e-03,   6.32265080e-03,
         6.30144501e-03,   6.05676985e-03,   5.76150818e-03,
         5.31724527e-03,   4.85865026e-03,   4.52805073e-03,
         4.34663869e-03,   4.14812609e-03,   3.97846688e-03,
         3.92382104e-03,   3.87133894e-03,   3.73531437e-03,
         3.68641667e-03,   3.36823989e-03,   3.08431672e-03,
         3.06347118e-03,   2.93730070e-03,   2.72532828e-03,
         2.62443466e-03,   2.52705951e-03,   2.47907070e-03,
         2.20607505e-03,   2.15555092e-03,   2.11096047e-03,
         1.87877035e-03,   1.85398874e-03,   1.71349741e-03,
         1.59687887e-03,   1.38681236e-03,   1.24691531e-03,
         8.12000096e-04,   3.61057236e-29,   1.30142117e-30])
In [88]:
#scree plot
plt.figure(figsize=(10, 6))
plt.plot(range(fit_pca.explained_variance_ratio_.shape[0]), np.cumsum(fit_pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.ylabel("Variance Explained")
plt.show();
In [89]:
# Re-apply PCA to the data while selecting for number of components to retain.

pca = PCA(n_components = 30)
data_pca = pca.fit_transform(s_df)
In [90]:
data_pca
Out[90]:
array([[ 3.52742287, -2.9068797 , -2.99129266, ..., -0.59703226,
        -0.91727316, -0.38434441],
       [-0.89428618,  0.35842381, -3.01412002, ..., -0.51740868,
         0.29141884,  1.26841119],
       [-3.65306058,  1.38007524, -0.7385278 , ..., -0.84820399,
         0.08152276,  0.03016591],
       ..., 
       [-1.30792235, -3.40585942, -2.98604858, ..., -0.96157144,
         1.0709555 , -0.72954257],
       [ 6.04254242, -3.50755208,  2.30113938, ..., -0.49298219,
         0.26505233, -0.20169116],
       [-1.01301685,  1.00266843,  3.02767496, ...,  0.03946604,
        -0.10079208,  0.47063208]])
In [91]:
pca.explained_variance_ratio_.sum()
Out[91]:
0.87976410027503449

Discussion 2.2: Perform Dimensionality Reduction

Following the initial fitting of the data and the generation of a scree plot, the decision was made to utilize 30 principal components. This choice was made as additional components beyond 30 did not appear to significantly impact the results.

Step 2.3: Interpret Principal Components

With our transformed principal components, it’s helpful to examine the influence of each original feature on the first few components. Each principal component is a unit vector pointing in the direction of highest variance. Features with weights far from zero have the most influence on that component.

To analyze this, I mapped each weight to its corresponding feature, sorted the features by their weights, and focused on the most prominent features at both ends of the list. Features with large weights of the same sign suggest positive correlations, while opposite signs indicate negative correlations.

I explored the feature associations for the first three principal components by writing a function to print the sorted list of feature weights for any component. This function will be useful later when interpreting cluster tendencies. The data dictionary helped me understand the relationships between these prominent features and what their positive or negative values might signify.

In [92]:
#map weights for the first principal component to corresponding feature names and then print the linked values, sorted by weight.
def weights(pca, i):
    '''
    i = index for desired principle component
    '''
    weight_map = {}
    for j, feature in enumerate(s_df.columns):
        weight_map[feature] = pca.components_[i][j]
    
    sorted_weight = sorted(weight_map.items(), key=operator.itemgetter(1), reverse=True)
    
    return sorted_weight
In [93]:
#using function to map first principal component
first = weights(pca, 0)
first
Out[93]:
[('PLZ8_ANTG3', 0.22908592233336147),
 ('PLZ8_ANTG4', 0.22430357831286174),
 ('PLZ8_BAUMAX', 0.21992832498780263),
 ('ORTSGR_KLS9', 0.20416644554340607),
 ('EWDICHTE', 0.20166856630027075),
 ('HH_EINKOMMEN_SCORE', 0.19360376785230096),
 ('FINANZ_HAUSBAUER', 0.15927249871805191),
 ('KBA05_ANTG4', 0.15520407385406654),
 ('PLZ8_ANTG2', 0.15487533229296474),
 ('ARBEIT', 0.1431528879208778),
 ('RELAT_AB', 0.13606968495827093),
 ('ANZ_HAUSHALTE_AKTIV', 0.13541678544851257),
 ('KBA05_ANTG3', 0.13458849488234492),
 ('FINANZ_SPARER', 0.13386510585103048),
 ('SEMIO_PFLICHT', 0.10359297244155434),
 ('SEMIO_REL', 0.10112237251768988),
 ('MOVEMENT', 0.099453530609906804),
 ('DECADE', 0.089153748313561473),
 ('SEMIO_RAT', 0.084981645092439689),
 ('SEMIO_TRADV', 0.078566800877228643),
 ('SEMIO_FAM', 0.071230085229593176),
 ('SEMIO_MAT', 0.070871718076377418),
 ('FINANZ_UNAUFFAELLIGER', 0.064852514545607823),
 ('SEMIO_KULT', 0.064036160284686727),
 ('REGIOTYP', 0.063746057454362737),
 ('FINANZ_ANLEGER', 0.056196595305644047),
 ('OST_WEST_KZ', 0.053064061137824338),
 ('PLZ8_HHZ', 0.042926286017708468),
 ('SEMIO_KAEM', 0.041544637414166148),
 ('SEMIO_SOZ', 0.041011150183637886),
 ('W_KEIT_KIND_HH', 0.037897853187708969),
 ('ANZ_HH_TITEL', 0.035226048166157438),
 ('VERS_TYP', 0.033442637501989794),
 ('KKK', 0.032870685755780046),
 ('HEALTH_TYP', 0.027512268699893424),
 ('SEMIO_DOM', 0.025945318496043576),
 ('ANREDE_KZ', 0.0078367499807837637),
 ('KBA05_ANTG2', 0.0075041112858347213),
 ('SEMIO_KRIT', 0.0060535264256318519),
 ('SOHO_KZ', -0.0019865543819795714),
 ('ANZ_TITEL', -0.0027828401474620586),
 ('RETOURTYP_BK_S', -0.011302777243340438),
 ('SEMIO_VERT', -0.038988315846222678),
 ('ONLINE_AFFINITAET', -0.046600046591248764),
 ('MIN_GEBAEUDEJAHR', -0.048460561591964463),
 ('WOHNDAUER_2008', -0.054884103246612873),
 ('SEMIO_LUST', -0.063301529641645857),
 ('SEMIO_ERL', -0.067839693916990823),
 ('ANZ_PERSONEN', -0.076763293553209211),
 ('KBA13_ANZAHL_PKW', -0.07995246818126768),
 ('WOHNLAGE', -0.088931645976290233),
 ('GREEN_AVANTGARDE', -0.099453530609906832),
 ('FINANZ_VORSORGER', -0.10302650533608686),
 ('ALTERSKATEGORIE_GROB', -0.10720676129182649),
 ('GEBAEUDETYP_RASTER', -0.12095176307766102),
 ('WEALTH', -0.12950640476987929),
 ('LIFE_STAGE', -0.12950640476987929),
 ('BALLRAUM', -0.13257140142572491),
 ('INNENSTADT', -0.17031285044653363),
 ('PLZ8_GBZ', -0.17061042091490067),
 ('KONSUMNAEHE', -0.1713217763858923),
 ('FINANZ_MINIMALIST', -0.20932455729167729),
 ('KBA05_GBZ', -0.21375667441369556),
 ('KBA05_ANTG1', -0.21856006684624874),
 ('PLZ8_ANTG1', -0.22857048102401209),
 ('MOBI_REGIO', -0.23897465597929127)]
In [94]:
#map weights for the second principal component to corresponding feature names and then print the linked values, sorted by weight.
second = weights(pca, 1)
second
Out[94]:
[('ALTERSKATEGORIE_GROB', 0.26398124819457736),
 ('FINANZ_VORSORGER', 0.23706847930109262),
 ('SEMIO_ERL', 0.23157860619463669),
 ('SEMIO_LUST', 0.18656597244578674),
 ('RETOURTYP_BK_S', 0.16433822110328894),
 ('W_KEIT_KIND_HH', 0.11705869641467315),
 ('SEMIO_KRIT', 0.11327231214960157),
 ('SEMIO_KAEM', 0.1092681328338912),
 ('FINANZ_HAUSBAUER', 0.1080478133006918),
 ('ANREDE_KZ', 0.087906284345808494),
 ('EWDICHTE', 0.086305771467459644),
 ('ORTSGR_KLS9', 0.085258015514567492),
 ('PLZ8_ANTG3', 0.082374072496764414),
 ('PLZ8_ANTG4', 0.079154302418402545),
 ('FINANZ_MINIMALIST', 0.077722670304169517),
 ('PLZ8_BAUMAX', 0.076124483893556943),
 ('SEMIO_DOM', 0.069889950233133255),
 ('KBA05_ANTG4', 0.06378432363914173),
 ('WOHNDAUER_2008', 0.063366495939497364),
 ('ARBEIT', 0.061255174008392116),
 ('RELAT_AB', 0.059165486068992794),
 ('PLZ8_ANTG2', 0.056702504786110122),
 ('ANZ_HAUSHALTE_AKTIV', 0.055860048863868861),
 ('HH_EINKOMMEN_SCORE', 0.041806829731465718),
 ('KBA05_ANTG3', 0.038750746709004358),
 ('VERS_TYP', 0.030643150030845289),
 ('ANZ_HH_TITEL', 0.030005059753134455),
 ('OST_WEST_KZ', 0.022614272622946047),
 ('PLZ8_HHZ', 0.011122230775684786),
 ('WEALTH', 0.011090605122836631),
 ('LIFE_STAGE', 0.011090605122836631),
 ('REGIOTYP', 0.011038567872250236),
 ('ANZ_TITEL', 0.0079108194866564542),
 ('MOVEMENT', 0.0048038907451233466),
 ('SOHO_KZ', -0.0017391599126611557),
 ('GREEN_AVANTGARDE', -0.0048038907451233483),
 ('KKK', -0.0059973373656140817),
 ('KBA05_ANTG2', -0.0094530199544385057),
 ('KBA13_ANZAHL_PKW', -0.032402604663517937),
 ('GEBAEUDETYP_RASTER', -0.039603846636620831),
 ('MIN_GEBAEUDEJAHR', -0.046191948169849652),
 ('WOHNLAGE', -0.05356132942026897),
 ('ANZ_PERSONEN', -0.055903449141959216),
 ('BALLRAUM', -0.057572673359885147),
 ('HEALTH_TYP', -0.058658382123241447),
 ('KBA05_ANTG1', -0.062278266847252878),
 ('KONSUMNAEHE', -0.063423644793314693),
 ('PLZ8_GBZ', -0.063484071410495585),
 ('SEMIO_VERT', -0.06585480397516906),
 ('INNENSTADT', -0.069578033577379364),
 ('KBA05_GBZ', -0.074824639506369126),
 ('MOBI_REGIO', -0.074923911281750452),
 ('PLZ8_ANTG1', -0.079509583926499497),
 ('SEMIO_SOZ', -0.10247130202515178),
 ('ONLINE_AFFINITAET', -0.16074095503245944),
 ('SEMIO_MAT', -0.16558999454827047),
 ('SEMIO_RAT', -0.17447791221172163),
 ('SEMIO_FAM', -0.18547491392053894),
 ('FINANZ_ANLEGER', -0.2093922625861343),
 ('SEMIO_KULT', -0.22068774641907682),
 ('FINANZ_UNAUFFAELLIGER', -0.22900715423366935),
 ('SEMIO_PFLICHT', -0.23189856678028856),
 ('SEMIO_TRADV', -0.23397563819516926),
 ('FINANZ_SPARER', -0.24177359474772939),
 ('DECADE', -0.25541667247665412),
 ('SEMIO_REL', -0.25759039406240308)]
In [95]:
#map weights for the third principal component to corresponding feature names
third = weights(pca, 2)
third
Out[95]:
[('SEMIO_VERT', 0.34603193736030574),
 ('SEMIO_SOZ', 0.26221075542260047),
 ('SEMIO_FAM', 0.24964240292722584),
 ('SEMIO_KULT', 0.23528270980109112),
 ('FINANZ_MINIMALIST', 0.16020212188483562),
 ('RETOURTYP_BK_S', 0.1069412701183394),
 ('FINANZ_VORSORGER', 0.10025983607990267),
 ('W_KEIT_KIND_HH', 0.087891388133503584),
 ('ALTERSKATEGORIE_GROB', 0.078045068648507243),
 ('SEMIO_REL', 0.069649587968750287),
 ('SEMIO_LUST', 0.063973029414174057),
 ('SEMIO_MAT', 0.055925739128069696),
 ('GREEN_AVANTGARDE', 0.054456994865036717),
 ('ORTSGR_KLS9', 0.045805576481800322),
 ('PLZ8_BAUMAX', 0.045381780080264451),
 ('EWDICHTE', 0.044859211258721804),
 ('PLZ8_ANTG4', 0.042532364844569834),
 ('PLZ8_ANTG3', 0.041498385917121169),
 ('WOHNDAUER_2008', 0.033465984391002096),
 ('ARBEIT', 0.032177798324851369),
 ('RELAT_AB', 0.02949374971255158),
 ('PLZ8_ANTG2', 0.027473140050248871),
 ('KBA05_ANTG4', 0.024510079339828284),
 ('ANZ_HAUSHALTE_AKTIV', 0.021618211743664257),
 ('OST_WEST_KZ', 0.013687775986077705),
 ('ANZ_HH_TITEL', 0.01270711431726565),
 ('ANZ_TITEL', 0.010203516098173456),
 ('KBA05_ANTG3', 0.0057225095769092718),
 ('PLZ8_HHZ', 0.0038961912292575151),
 ('VERS_TYP', 0.0015124679071747496),
 ('SOHO_KZ', 0.00030326900707413002),
 ('ANZ_PERSONEN', -0.0055473837499681287),
 ('WEALTH', -0.0060686963295723019),
 ('LIFE_STAGE', -0.0060686963295723019),
 ('REGIOTYP', -0.0093004144740783858),
 ('KBA05_ANTG2', -0.010978895285893147),
 ('KBA05_ANTG1', -0.015602319591368555),
 ('MIN_GEBAEUDEJAHR', -0.015989697578717554),
 ('KKK', -0.019424265524786512),
 ('KBA05_GBZ', -0.020306368883477019),
 ('HEALTH_TYP', -0.02133629345567974),
 ('KBA13_ANZAHL_PKW', -0.022204229912108343),
 ('WOHNLAGE', -0.022575331255141676),
 ('MOBI_REGIO', -0.022931552591587464),
 ('HH_EINKOMMEN_SCORE', -0.026719570593222802),
 ('GEBAEUDETYP_RASTER', -0.029272065277585711),
 ('BALLRAUM', -0.034432908214344991),
 ('KONSUMNAEHE', -0.035714140499739734),
 ('PLZ8_GBZ', -0.035745117484970786),
 ('INNENSTADT', -0.04172654862898259),
 ('PLZ8_ANTG1', -0.04211108898452906),
 ('FINANZ_HAUSBAUER', -0.047569046914246516),
 ('ONLINE_AFFINITAET', -0.050425330774912631),
 ('MOVEMENT', -0.054456994865036724),
 ('SEMIO_TRADV', -0.077676098704074797),
 ('SEMIO_PFLICHT', -0.078435070103864801),
 ('FINANZ_UNAUFFAELLIGER', -0.097802282982882849),
 ('DECADE', -0.10534726144201183),
 ('FINANZ_SPARER', -0.10666619622703408),
 ('SEMIO_ERL', -0.1770974583807628),
 ('FINANZ_ANLEGER', -0.18976424945433035),
 ('SEMIO_RAT', -0.21664439490992132),
 ('SEMIO_KRIT', -0.2766041774758306),
 ('SEMIO_DOM', -0.31311834574445535),
 ('SEMIO_KAEM', -0.33716635393319166),
 ('ANREDE_KZ', -0.36853895173068496)]

Discussion 2.3: Interpret Principal Components

The first principal component, PLZ8_ANTG3, represents the number of 6-10 family (large family) houses in the region and has a positive influence. On the other hand, the feature with the most negative weight, MOBI_REGIO, characterizes movement patterns.

In the second principal component, the most positively weighted feature is ALTERSKATEGORIE_GROB, which estimates age based on an analysis of their name. Conversely, the feature with the most negative weight in this component is SEMIO_REL, which identifies their religious personality typology.

The third principal component is characterized by the feature with the largest positive weight, SEMIO_VERT, representing a dreamful personality typology. The feature with the largest negative weight in this component is ANREDE_KZ, which corresponds to gender.

Step 3: Clustering

Step 3.1: Apply Clustering to General Population

K-Means Clustering Process

Now that the data is scaled and transformed, it's time to apply k-means clustering to see how the data groups in the principal components space. Here's the approach I took:

  • 1. K-Means Clustering - I used sklearn's KMeans class to perform clustering on the PCA-transformed data.
  • 2. Evaluate Cluster Quality - For each cluster count, I calculated the average distance from each point to its assigned cluster center. I used the .score() method, keeping in mind that sklearn's scores are usually better when larger, so I reversed the sign to interpret the results correctly.
  • 3. Determine Optimal Cluster Count - I tested various cluster counts, observing how the average distance decreases with more clusters. The goal was to find the point where adding more clusters yields diminishing returns, and I chose the optimal cluster count based on this analysis.
  • 4. Final Clustering - Once I determined the final number of clusters, I re-fitted the KMeans instance and saved the cluster assignments for use in the next steps.

This process will help identify the natural groupings in the data and set the stage for further analysis.

In [96]:
# Over a number of different cluster counts...
k = list(range(5, 20, 2)) #list 5-19 odd

def get_kmeans(data, k):
    # run k-means clustering on the data and...
    kmeans = KMeans(n_clusters=k)
    model = kmeans.fit(data)
    # compute the average within-cluster distances.
    score = np.abs(model.score(data))
    
    return score
In [97]:
# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.

scores = []

for i in k:
    scores.append(get_kmeans(data_pca, i))
In [98]:
#scree plot
plt.plot(k, scores, linestyle='solid', marker='o', color='g');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
In [194]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=15)
model = kmeans.fit(data_pca)
gen_pop_predict = model.predict(data_pca)
In [195]:
#finding cluster centers
cluster_centers = model.cluster_centers_
In [196]:
#scatter plot of the data points for each cluster
plt.figure(figsize=(8, 6))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=gen_pop_predict, cmap='PRGn')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='o', s=50, label='Cluster Centers')
plt.title('K-Means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend();

Discussion 3.1: Apply Clustering to General Population

According to the scree plot, it becomes evident that having 15 clusters is the point at which additional clusters cease to provide any significant benefits.

Step 3.2: Apply All Steps to the Customer Data

Now that the clusters and cluster centers have been identified for the general population, the next step is to map the customer data onto these clusters. This process involves applying the transformations and fits from the general population to the customer data without re-fitting the models

In [197]:
# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv', sep=';')

customers.head()
Out[197]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.0 5 1 5 1 2 2 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
1 -1 4 1 NaN 5 1 5 1 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1 4 2 2.0 5 1 5 1 4 4 ... 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 5.0 3.0
3 1 4 1 2.0 5 1 5 2 1 2 ... 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
4 -1 3 1 6.0 3 1 4 4 5 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 3.0 5.0 1.0

5 rows × 85 columns

In [198]:
customers.isna().sum().sum()
Out[198]:
2252274
In [199]:
customers.shape[0]
Out[199]:
191652
In [200]:
# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.

clean_customers = clean_data(customers)

clean_customers.head()
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py:5890: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
Out[200]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB DECADE MOVEMENT WEALTH LIFE_STAGE
0 4.0 1.0 5.0 1.0 5.0 1.0 2.0 2.0 1.0 1.0 ... 1.0 5.0 5.0 1.0 2.0 1.0 1.0 0.0 3.0 3.0
1 4.0 2.0 5.0 1.0 5.0 1.0 4.0 4.0 1.0 2.0 ... 3.0 3.0 2.0 3.0 5.0 3.0 1.0 0.0 4.0 4.0
2 4.0 1.0 5.0 1.0 5.0 2.0 1.0 2.0 0.0 2.0 ... 1.0 3.0 4.0 1.0 3.0 1.0 0.0 1.0 4.0 4.0
3 3.0 1.0 3.0 1.0 4.0 4.0 5.0 2.0 0.0 3.0 ... 2.0 3.0 3.0 3.0 5.0 1.0 3.0 1.0 1.0 1.0
4 3.0 1.0 5.0 1.0 5.0 1.0 2.0 3.0 1.0 3.0 ... 1.0 5.0 5.0 3.0 7.0 5.0 1.0 0.0 4.0 4.0

5 rows × 66 columns

In [201]:
clean_customers.shape[0]
Out[201]:
141713
In [202]:
#scale
cust_s = s.transform(clean_customers)
cust_s = pd.DataFrame(cust_s, columns=clean_customers.columns)
In [203]:
cust_s.head()
Out[203]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB DECADE MOVEMENT WEALTH LIFE_STAGE
0 1.184102 -1.043914 1.409077 -1.155482 1.138133 -1.249960 -0.470299 -0.791347 1.885219 -1.610602 ... -0.633329 1.430429 1.485680 -2.173492 -1.432853 -1.526014 -1.619800 -1.885219 0.085189 0.085189
1 1.184102 0.957933 1.409077 -1.155482 1.138133 -1.249960 0.958818 0.629041 1.885219 -0.257026 ... 0.748876 -0.648841 -1.248668 -0.166432 -0.127075 -0.052103 -1.619800 -1.885219 0.761271 0.761271
2 1.184102 -1.043914 1.409077 -1.155482 1.138133 -0.570958 -1.184858 -0.791347 -0.530442 -0.257026 ... -0.633329 -0.648841 0.574231 -2.173492 -0.997594 -1.526014 -2.317978 0.530442 0.761271 0.761271
3 0.200529 -1.043914 -0.042764 -1.155482 0.411859 0.787046 1.673376 -0.791347 -0.530442 1.096551 ... 0.057773 -0.648841 -0.337218 -0.166432 -0.127075 -1.526014 -0.223444 0.530442 -1.266976 -1.266976
4 0.200529 -1.043914 1.409077 -1.155482 1.138133 -1.249960 -0.470299 -0.081153 1.885219 1.096551 ... -0.633329 1.430429 1.485680 -0.166432 0.743444 1.421808 -1.619800 -1.885219 0.761271 0.761271

5 rows × 66 columns

In [204]:
#tranforming customers using pca
pca_customers = pca.transform(cust_s)
pca_customers
Out[204]:
array([[-5.64554919,  1.07684711,  4.35430184, ..., -0.34901697,
        -0.0066858 ,  0.11374425],
       [-1.1509697 ,  5.11332146, -0.57458706, ...,  0.26272396,
         0.63083949, -1.01330661],
       [-4.56911491,  0.41370031,  2.71558137, ..., -0.46464984,
        -0.45299421, -0.54732898],
       ..., 
       [-4.28081253,  2.91133888,  2.50369361, ...,  0.99851939,
         0.9955746 ,  0.14542296],
       [-0.92848765,  2.84894987, -1.9673653 , ..., -0.78756383,
         0.65353315,  0.49888096],
       [-3.95904599, -2.6956019 ,  2.32073937, ..., -0.85446533,
         0.38935444,  1.36706512]])
In [205]:
#predicting clusters
cust_predict = model.predict(pca_customers)
cust_predict
Out[205]:
array([11,  3,  2, ..., 11,  4,  2], dtype=int32)
In [206]:
cust_predict.shape
Out[206]:
(141713,)

Step 3.3: Compare Customer Data to Demographics Data

The final task is to compare the cluster distributions between the general population and the customer base to identify which clusters are overrepresented or underrepresented among the customers.

In [207]:
#plots showing clusters
#used for syntax https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18,5))

sns.countplot(gen_pop_predict, ax=ax1)
ax1.set_title('General Clusters')
sns.countplot(cust_predict, ax=ax2)
ax2.set_title('Customer Clusters');
In [218]:
#What kinds of people are part of a cluster that is overrepresented in the customer data compared to the general population?
#used for syntax https://stackoverflow.com/questions/49885007/how-to-use-scikit-learn-inverse-transform-with-new-values
cluster_11 = s.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[11]))
In [219]:
over = pd.Series(data = cluster_11, index=clean_customers.columns)
over
Out[219]:
ALTERSKATEGORIE_GROB       3.285754
ANREDE_KZ                  1.049813
FINANZ_MINIMALIST          4.818104
FINANZ_SPARER              1.639848
FINANZ_VORSORGER           4.172732
FINANZ_ANLEGER             1.472954
FINANZ_UNAUFFAELLIGER      2.171839
FINANZ_HAUSBAUER           1.902001
GREEN_AVANTGARDE           0.972321
HEALTH_TYP                 2.056336
RETOURTYP_BK_S             3.960810
SEMIO_SOZ                  4.861855
SEMIO_FAM                  4.861142
SEMIO_REL                  3.766625
SEMIO_MAT                  3.940147
SEMIO_VERT                 6.235738
SEMIO_LUST                 5.286958
SEMIO_ERL                  4.070679
SEMIO_KULT                 4.898366
SEMIO_RAT                  2.687683
SEMIO_KRIT                 3.126071
SEMIO_DOM                  3.363179
SEMIO_KAEM                 2.650101
SEMIO_PFLICHT              3.246229
SEMIO_TRADV                3.161888
SOHO_KZ                    0.008696
VERS_TYP                   1.530339
ANZ_PERSONEN               2.206120
ANZ_TITEL                  0.000312
HH_EINKOMMEN_SCORE         2.456882
                            ...    
OST_WEST_KZ                0.103911
WOHNLAGE                   3.013820
KBA05_ANTG1                2.666454
KBA05_ANTG2                1.138174
KBA05_ANTG3                0.078994
KBA05_ANTG4                0.005810
KBA05_GBZ                  4.031641
BALLRAUM                   3.972742
EWDICHTE                   4.065986
INNENSTADT                 4.538643
GEBAEUDETYP_RASTER         3.955746
KKK                        2.161797
MOBI_REGIO                 4.085188
ONLINE_AFFINITAET          3.094526
REGIOTYP                   3.490836
KBA13_ANZAHL_PKW         690.237486
PLZ8_ANTG1                 2.757443
PLZ8_ANTG2                 2.616766
PLZ8_ANTG3                 1.221891
PLZ8_ANTG4                 0.400610
PLZ8_BAUMAX                1.260107
PLZ8_HHZ                   3.725018
PLZ8_GBZ                   3.891678
ARBEIT                     3.066084
ORTSGR_KLS9                5.529081
RELAT_AB                   2.950813
DECADE                     2.620571
MOVEMENT                   0.027679
WEALTH                     3.614265
LIFE_STAGE                 3.614265
Length: 66, dtype: float64
In [216]:
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
cluster_10 = s.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[10]))
In [217]:
under = pd.Series(data = cluster_10, index=clean_customers.columns)
under
Out[217]:
ALTERSKATEGORIE_GROB       3.316416
ANREDE_KZ                  1.051005
FINANZ_MINIMALIST          3.529760
FINANZ_SPARER              1.975809
FINANZ_VORSORGER           4.199857
FINANZ_ANLEGER             1.815016
FINANZ_UNAUFFAELLIGER      1.888913
FINANZ_HAUSBAUER           3.233434
GREEN_AVANTGARDE           0.033785
HEALTH_TYP                 2.373274
RETOURTYP_BK_S             4.258308
SEMIO_SOZ                  4.818082
SEMIO_FAM                  5.106048
SEMIO_REL                  3.806917
SEMIO_MAT                  4.155899
SEMIO_VERT                 5.825029
SEMIO_LUST                 4.943657
SEMIO_ERL                  4.231193
SEMIO_KULT                 4.962274
SEMIO_RAT                  2.645156
SEMIO_KRIT                 3.673223
SEMIO_DOM                  3.401704
SEMIO_KAEM                 2.818262
SEMIO_PFLICHT              3.364461
SEMIO_TRADV                2.899784
SOHO_KZ                    0.008312
VERS_TYP                   1.648778
ANZ_PERSONEN               1.402718
ANZ_TITEL                  0.000301
HH_EINKOMMEN_SCORE         5.125255
                            ...    
OST_WEST_KZ                0.167080
WOHNLAGE                   3.587525
KBA05_ANTG1                0.844066
KBA05_ANTG2                1.913849
KBA05_ANTG3                1.034797
KBA05_ANTG4                0.224023
KBA05_GBZ                  2.736385
BALLRAUM                   4.265283
EWDICHTE                   4.125782
INNENSTADT                 4.632602
GEBAEUDETYP_RASTER         3.768550
KKK                        2.861523
MOBI_REGIO                 2.423838
ONLINE_AFFINITAET          1.872489
REGIOTYP                   4.752699
KBA13_ANZAHL_PKW         673.140593
PLZ8_ANTG1                 2.136078
PLZ8_ANTG2                 3.172382
PLZ8_ANTG3                 1.781339
PLZ8_ANTG4                 0.666453
PLZ8_BAUMAX                1.760604
PLZ8_HHZ                   3.819944
PLZ8_GBZ                   3.562750
ARBEIT                     3.376590
ORTSGR_KLS9                5.514092
RELAT_AB                   3.374855
DECADE                     2.533608
MOVEMENT                   0.966215
WEALTH                     2.812114
LIFE_STAGE                 2.812114
Length: 66, dtype: float64

Discussion 3.3: Compare Customer Data to Demographics Data

The same groups that seem to over represented in the customer data (cluster 11) are the same top groups in the under represented data (cluster 10):

ALTERSKATEGORIE_GROB - Estimated age based on given name analysis - 46 - 60 year olds

ANREDE_KZ - Gender - Male

FINANZ_MINIMALIST - MINIMALIST: low financial interest Avg - Lowest

In [ ]: