Project: Identify Customer Segments¶

In this project, I will use unsupervised learning techniques to identify key customer segments for a German mail-order sales company. The goal is to target marketing campaigns more effectively by focusing on the most promising segments. The data is provided by Bertelsmann Arvato Analytics, representing a real-world data science challenge.

# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import operator  
from sklearn.cluster import KMeans

# magic word for producing visualizations in notebook
%matplotlib inline

Step 0: Load the Data¶

There are four files associated with this project (not including this one):

Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
Data_Dictionary.md: Detailed information file about the features in the provided datasets.
AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. This information is used to cluster the general population into groups with similar demographic properties. Then see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv', sep=';')

# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep=';')

# Check the structure of the data after it's loaded 
azdias.shape

(891221, 85)

azdias.describe()

azdias.head(5)

feat_info.shape

(85, 4)

feat_info.describe()

feat_info.head(40)

Step 1: Preprocessing¶

Step 1.1:

I started by evaluating the demographics data for missing values. Using the feature summary file, I analyzed each column to guide my cleaning decisions. It's important to document observations and decisions in the provided discussion cells.

Step 1.1.1: Convert Missing Value Codes to NaNs¶

Next, I converted any 'missing' or 'unknown' value codes from the data dictionary into NaN values for consistency. This required parsing the data since missing codes were read as strings. I also noted how much data was flagged as missing or unknown compared to naturally missing data.

#values missing before conversion
azdias.isna().sum().sum()

4896838

# Identify missing or unknown data values and convert them to NaNs.

for attribute, miss in zip(feat_info['attribute'], feat_info['missing_or_unknown']):
    missing_values = miss.strip('[]').split(',')
    miss_list = ['X', 'XX', ''] #including X and XX due to errors when running the following int conversion
    missing_values = [int(value) if (value not in miss_list) else value for value in missing_values]
    if missing_values != ['']:
        azdias[attribute] = azdias[attribute].replace(missing_values, np.nan)

#values missing after conversion
azdias.isna().sum().sum()

8373929

Step 1.1.2: Assess Missing Data in Each Column¶

I used Matplotlib's hist() function to visualize the distribution of missing values across columns. This helped me identify outlier columns with a high proportion of missing data. I decided to remove these columns from the dataframe, although I documented any notable findings about them in the discussion section. Additionally, I examined the remaining features for any patterns in missing data, such as columns that share similar missing values

#Perform an assessment of how much missing data there is in each column of the dataset.

#columns with null values
col_null = [col for col in azdias.columns if azdias[col].isna().any()]
col_null

['AGER_TYP',
 'ALTERSKATEGORIE_GROB',
 'CJT_GESAMTTYP',
 'GEBURTSJAHR',
 'GFK_URLAUBERTYP',
 'HEALTH_TYP',
 'LP_LEBENSPHASE_FEIN',
 'LP_LEBENSPHASE_GROB',
 'LP_FAMILIE_FEIN',
 'LP_FAMILIE_GROB',
 'LP_STATUS_FEIN',
 'LP_STATUS_GROB',
 'NATIONALITAET_KZ',
 'PRAEGENDE_JUGENDJAHRE',
 'RETOURTYP_BK_S',
 'SHOPPER_TYP',
 'SOHO_KZ',
 'TITEL_KZ',
 'VERS_TYP',
 'ALTER_HH',
 'ANZ_PERSONEN',
 'ANZ_TITEL',
 'HH_EINKOMMEN_SCORE',
 'KK_KUNDENTYP',
 'W_KEIT_KIND_HH',
 'WOHNDAUER_2008',
 'ANZ_HAUSHALTE_AKTIV',
 'ANZ_HH_TITEL',
 'GEBAEUDETYP',
 'KONSUMNAEHE',
 'MIN_GEBAEUDEJAHR',
 'OST_WEST_KZ',
 'WOHNLAGE',
 'CAMEO_DEUG_2015',
 'CAMEO_DEU_2015',
 'CAMEO_INTL_2015',
 'KBA05_ANTG1',
 'KBA05_ANTG2',
 'KBA05_ANTG3',
 'KBA05_ANTG4',
 'KBA05_BAUMAX',
 'KBA05_GBZ',
 'BALLRAUM',
 'EWDICHTE',
 'INNENSTADT',
 'GEBAEUDETYP_RASTER',
 'KKK',
 'MOBI_REGIO',
 'ONLINE_AFFINITAET',
 'REGIOTYP',
 'KBA13_ANZAHL_PKW',
 'PLZ8_ANTG1',
 'PLZ8_ANTG2',
 'PLZ8_ANTG3',
 'PLZ8_ANTG4',
 'PLZ8_BAUMAX',
 'PLZ8_HHZ',
 'PLZ8_GBZ',
 'ARBEIT',
 'ORTSGR_KLS9',
 'RELAT_AB']

#getting the number of missing null values for those columns
col_null_counts = {col: azdias[col].isna().sum() for col in col_null}
col_null_counts

{'AGER_TYP': 685843,
 'ALTERSKATEGORIE_GROB': 2881,
 'CJT_GESAMTTYP': 4854,
 'GEBURTSJAHR': 392318,
 'GFK_URLAUBERTYP': 4854,
 'HEALTH_TYP': 111196,
 'LP_LEBENSPHASE_FEIN': 97632,
 'LP_LEBENSPHASE_GROB': 94572,
 'LP_FAMILIE_FEIN': 77792,
 'LP_FAMILIE_GROB': 77792,
 'LP_STATUS_FEIN': 4854,
 'LP_STATUS_GROB': 4854,
 'NATIONALITAET_KZ': 108315,
 'PRAEGENDE_JUGENDJAHRE': 108164,
 'RETOURTYP_BK_S': 4854,
 'SHOPPER_TYP': 111196,
 'SOHO_KZ': 73499,
 'TITEL_KZ': 889061,
 'VERS_TYP': 111196,
 'ALTER_HH': 310267,
 'ANZ_PERSONEN': 73499,
 'ANZ_TITEL': 73499,
 'HH_EINKOMMEN_SCORE': 18348,
 'KK_KUNDENTYP': 584612,
 'W_KEIT_KIND_HH': 147988,
 'WOHNDAUER_2008': 73499,
 'ANZ_HAUSHALTE_AKTIV': 99611,
 'ANZ_HH_TITEL': 97008,
 'GEBAEUDETYP': 93148,
 'KONSUMNAEHE': 73969,
 'MIN_GEBAEUDEJAHR': 93148,
 'OST_WEST_KZ': 93148,
 'WOHNLAGE': 93148,
 'CAMEO_DEUG_2015': 99352,
 'CAMEO_DEU_2015': 99352,
 'CAMEO_INTL_2015': 99352,
 'KBA05_ANTG1': 133324,
 'KBA05_ANTG2': 133324,
 'KBA05_ANTG3': 133324,
 'KBA05_ANTG4': 133324,
 'KBA05_BAUMAX': 476524,
 'KBA05_GBZ': 133324,
 'BALLRAUM': 93740,
 'EWDICHTE': 93740,
 'INNENSTADT': 93740,
 'GEBAEUDETYP_RASTER': 93155,
 'KKK': 158064,
 'MOBI_REGIO': 133324,
 'ONLINE_AFFINITAET': 4854,
 'REGIOTYP': 158064,
 'KBA13_ANZAHL_PKW': 105800,
 'PLZ8_ANTG1': 116515,
 'PLZ8_ANTG2': 116515,
 'PLZ8_ANTG3': 116515,
 'PLZ8_ANTG4': 116515,
 'PLZ8_BAUMAX': 116515,
 'PLZ8_HHZ': 116515,
 'PLZ8_GBZ': 116515,
 'ARBEIT': 97375,
 'ORTSGR_KLS9': 97274,
 'RELAT_AB': 97375}

# Investigate patterns in the amount of missing data in each column.
#% of missing data for each column 
p_col_null = {col: (azdias[col].isna().mean() * 100) for col in col_null}
p_col_null

{'AGER_TYP': 76.955435296071343,
 'ALTERSKATEGORIE_GROB': 0.32326437550282144,
 'CJT_GESAMTTYP': 0.54464605299919999,
 'GEBURTSJAHR': 44.02028228688507,
 'GFK_URLAUBERTYP': 0.54464605299919999,
 'HEALTH_TYP': 12.476815514894735,
 'LP_LEBENSPHASE_FEIN': 10.954858559212585,
 'LP_LEBENSPHASE_GROB': 10.611509378706293,
 'LP_FAMILIE_FEIN': 8.7286991666489016,
 'LP_FAMILIE_GROB': 8.7286991666489016,
 'LP_STATUS_FEIN': 0.54464605299919999,
 'LP_STATUS_GROB': 0.54464605299919999,
 'NATIONALITAET_KZ': 12.153551139391913,
 'PRAEGENDE_JUGENDJAHRE': 12.136608091595686,
 'RETOURTYP_BK_S': 0.54464605299919999,
 'SHOPPER_TYP': 12.476815514894735,
 'SOHO_KZ': 8.2470004634091882,
 'TITEL_KZ': 99.757635872583791,
 'VERS_TYP': 12.476815514894735,
 'ALTER_HH': 34.813699407890972,
 'ANZ_PERSONEN': 8.2470004634091882,
 'ANZ_TITEL': 8.2470004634091882,
 'HH_EINKOMMEN_SCORE': 2.0587486156632306,
 'KK_KUNDENTYP': 65.596748730112949,
 'W_KEIT_KIND_HH': 16.605084485217471,
 'WOHNDAUER_2008': 8.2470004634091882,
 'ANZ_HAUSHALTE_AKTIV': 11.176913470396231,
 'ANZ_HH_TITEL': 10.884842255736793,
 'GEBAEUDETYP': 10.451728583594866,
 'KONSUMNAEHE': 8.299737102245123,
 'MIN_GEBAEUDEJAHR': 10.451728583594866,
 'OST_WEST_KZ': 10.451728583594866,
 'WOHNLAGE': 10.451728583594866,
 'CAMEO_DEUG_2015': 11.147852216229195,
 'CAMEO_DEU_2015': 11.147852216229195,
 'CAMEO_INTL_2015': 11.147852216229195,
 'KBA05_ANTG1': 14.959701353536328,
 'KBA05_ANTG2': 14.959701353536328,
 'KBA05_ANTG3': 14.959701353536328,
 'KBA05_ANTG4': 14.959701353536328,
 'KBA05_BAUMAX': 53.468668265222654,
 'KBA05_GBZ': 14.959701353536328,
 'BALLRAUM': 10.518154307405233,
 'EWDICHTE': 10.518154307405233,
 'INNENSTADT': 10.518154307405233,
 'GEBAEUDETYP_RASTER': 10.452514022896677,
 'KKK': 17.73566825736826,
 'MOBI_REGIO': 14.959701353536328,
 'ONLINE_AFFINITAET': 0.54464605299919999,
 'REGIOTYP': 17.73566825736826,
 'KBA13_ANZAHL_PKW': 11.871354018812394,
 'PLZ8_ANTG1': 13.073637178657146,
 'PLZ8_ANTG2': 13.073637178657146,
 'PLZ8_ANTG3': 13.073637178657146,
 'PLZ8_ANTG4': 13.073637178657146,
 'PLZ8_BAUMAX': 13.073637178657146,
 'PLZ8_HHZ': 13.073637178657146,
 'PLZ8_GBZ': 13.073637178657146,
 'ARBEIT': 10.92602171627464,
 'ORTSGR_KLS9': 10.91468894920564,
 'RELAT_AB': 10.92602171627464}

#converting to df to make it easier to merge and plot
df1 = pd.DataFrame(list(col_null_counts.items()), columns=['Attribute', 'Count'])
df2 = pd.DataFrame(list(p_col_null.items()), columns=['Attribute', 'Percentage'])

df2.shape

(61, 2)

#merging into one null df
null_df = pd.merge(df1, df2, on='Attribute')
null_df.head()

#sorting the null values by count
null_df = null_df.sort_values(by='Count', ascending=False)
null_df

#sorting null values by precentage
null_df.sort_values(by='Percentage', ascending=False)

#histogram of count column
null_df['Count'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Count');

#histogram of percentage column
null_df['Percentage'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Percentage');

# Investigate patterns in the amount of missing data in each column.
null_df.plot.bar(x='Attribute', y='Percentage', figsize=(18, 8))
plt.xlabel('Attribute')
plt.ylabel('Percentage')
plt.title('Bar Graph of Percentage');

# Removed the outlier columns from the dataset.

outliers = null_df.loc[null_df['Percentage'] > 20, 'Attribute'].tolist()
outliers

['TITEL_KZ',
 'AGER_TYP',
 'KK_KUNDENTYP',
 'KBA05_BAUMAX',
 'GEBURTSJAHR',
 'ALTER_HH']

#dropping the columns
azdias = azdias.drop(columns=outliers)
azdias.shape #shows that 6 columns were dropped

(891221, 79)

#dropping outlier rows from null df
null_df = null_df[~null_df['Attribute'].isin(outliers)]

#looking at the plot again of the remaining columns with nulls
null_df.plot.bar(x='Attribute', y='Percentage', figsize=(18, 8))
plt.xlabel('Attribute')
plt.ylabel('Percentage')
plt.title('Bar Graph of Percentage');

Discussion 1.1.2: Assess Missing Data in Each Column¶

Among the 85 columns in this dataset, 6 columns exhibited null values exceeding 20%. The majority of columns with missing values reported nulls in 8% to 18% of the fields. Additionally, several columns shared identical percentages of null values with each other.

Step 1.1.3: Assess Missing Data in Each Row¶

Next, I assessed missing data at the row level. I divided the dataset into two subsets: one with rows that exceed a threshold of missing values and another with rows below that threshold.

To determine how to handle outlier rows, I compared the distributions of non-missing data across both subsets. I selected at least five columns with little to no missing data and used Seaborn's countplot() and Matplotlib's subplot() to create side-by-side comparisons.

Based on the comparison results, I decided whether dropping the rows with many missing values would affect the analysis. Regardless, I continued the analysis using the subset with fewer missing values.

# How much data is missing in each row of the dataset?
azdias['row_null_count'] =   azdias.isnull().sum(axis=1)

azdias['row_null_count'].describe()

count    891221.000000
mean          5.649894
std          13.234687
min           0.000000
25%           0.000000
50%           0.000000
75%           3.000000
max          49.000000
Name: row_null_count, dtype: float64

azdias['row_null_count'].value_counts()

0     623209
47     45578
2      27926
43     27369
8      24592
5      22515
3      17629
1      15738
6      13771
7      13714
4      12607
34     10816
10      5410
15      4743
35      3911
13      3255
9       3042
33      2985
16      2505
14      2243
19      1180
11      1127
12       766
17       677
37       538
45       494
18       428
38       421
41       356
20       349
32       206
44       155
21       150
40       137
23       132
22       129
36        84
39        77
24        69
26        59
25        55
27        24
42        21
29        12
30         6
28         5
31         3
49         2
48         1
Name: row_null_count, dtype: int64

azdias['row_null_count'].hist()
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram of Null Counts Per Row');

#divide the data into two subsets based on the number of missing values in each row.

data_low = azdias.query('row_null_count <= 25')
data_high = azdias.query('row_null_count > 25')

#number of rows in the low subset
data_low.shape

(797961, 80)

#number of rows in the high subset
data_high.shape

(93260, 80)

#percentage of rows in the high subset
93269/8373929 * 100

1.11380213517454

#compare the distribution of values for  columns where there are no or few missing values, between the two subsets.
#finding which columns have no null values
col_not_null = [col for col in azdias.columns if azdias[col].notnull().all()]
col_not_null

['ANREDE_KZ',
 'FINANZ_MINIMALIST',
 'FINANZ_SPARER',
 'FINANZ_VORSORGER',
 'FINANZ_ANLEGER',
 'FINANZ_UNAUFFAELLIGER',
 'FINANZ_HAUSBAUER',
 'FINANZTYP',
 'GREEN_AVANTGARDE',
 'SEMIO_SOZ',
 'SEMIO_FAM',
 'SEMIO_REL',
 'SEMIO_MAT',
 'SEMIO_VERT',
 'SEMIO_LUST',
 'SEMIO_ERL',
 'SEMIO_KULT',
 'SEMIO_RAT',
 'SEMIO_KRIT',
 'SEMIO_DOM',
 'SEMIO_KAEM',
 'SEMIO_PFLICHT',
 'SEMIO_TRADV',
 'ZABEOTYP',
 'row_null_count']

#number of columns with no null
len(col_not_null)

25

def compare_column_distribution(data1, data2, column):
    '''
    This function creates the same graph for two data sets side by side.
    
    data1 - first data set
    data2 - second data set
    column - name of the column
    '''
    
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    sns.countplot(x=column, data=data1)
    plt.title(f'Distribution in {column} data1')
    
    plt.subplot(1, 2, 2)
    sns.countplot(x=column, data=data2)
    plt.title(f'Distribution in {column} data2')
    
    plt.show();

#running a loop to input all of the columns through the function
for col in col_not_null:
    compare_column_distribution(data_low, data_high, col)

Discussion 1.1.3: Assess Missing Data in Each Row¶

The dataset exhibiting minimal or no missing values (25 or fewer) starkly differs from the dataset with a substantial number of missing values. The segregation of data induces notable changes in the distribution.

Step 1.2: Select and Re-Encode Features¶

It's essential to prepare the dataset for analysis by ensuring all features are encoded numerically. Although most values are numbers, not all represent numeric data. Here's how I approached this:

Numeric and Interval Data: These features were retained without changes.
Ordinal Data: I assumed these could be treated as interval data and kept them unchanged.
Categorical and Mixed-Type Data: I carefully examined these features, deciding whether to keep, drop, or re-encode them based on their nature.

After making these decisions, I created a new dataframe with the selected and engineered columns, ready for machine learning.

#creating copy of feat_info df

copy_feat = feat_info.copy()
copy_feat.shape

(85, 4)

#dropping rows in new df to match the previous dropped columns

copy_feat = copy_feat[~copy_feat['attribute'].isin(outliers)]
copy_feat.shape

(79, 4)

#how many features are there of each data type?

copy_feat.type.value_counts()

ordinal        49
categorical    18
mixed           6
numeric         6
Name: type, dtype: int64

#creating the list
cat_to_check = []
mix_to_check = []

#appending to the list all the columns that will need to be checked
for index, row in copy_feat.iterrows():
    if row['type'] == 'categorical':
        cat_to_check.append(row['attribute'])
    elif row['type'] == 'mixed':
        mix_to_check.append(row['attribute'])

#verifying that the correct total has been added to the list

(len(cat_to_check), len(mix_to_check))

(18, 6)

#assessing the categorical variables: which are binary, which are multi-level, and which one needs to be re-encoded?

#create empty lists
binary_feat = []
multi_feat = []

#seperating binary and multi-level categorical variables

for i in cat_to_check:
    if (data_low[i].nunique() > 2):
        multi_feat.append(i)
    else:
        binary_feat.append(i)

#checking the binary to find the one with non-numeric values
data_low[binary_feat].head()

#checking the values
data_low.OST_WEST_KZ.unique()

array(['W', 'O'], dtype=object)

#re-encode the values as numbers
data_low['OST_WEST_KZ'].replace('O', 1, inplace = True)
data_low['OST_WEST_KZ'].replace('W', 0, inplace = True)

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py:5890: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

#checking that it worked
data_low.OST_WEST_KZ.unique()

array([0, 1])

data_low[multi_feat].head()

# Re-encode categorical variable(s) to be kept in the analysis.

data_low= data_low.drop(multi_feat, axis = 1)
data_low.shape

(797961, 67)

Discussion 1.2.1: Re-Encode Categorical Features¶

To streamline the procedure all multi-feature category variables were excluded and dropped. Additionally, a binary variable was transformed into numeric values.

#checking which columns are mixed variables
mix_to_check

['LP_LEBENSPHASE_FEIN',
 'LP_LEBENSPHASE_GROB',
 'PRAEGENDE_JUGENDJAHRE',
 'WOHNLAGE',
 'CAMEO_INTL_2015',
 'PLZ8_BAUMAX']

# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.

data_low.PRAEGENDE_JUGENDJAHRE.value_counts()

14.0    182968
8.0     141612
10.0     85800
5.0      84687
3.0      53841
15.0     42543
11.0     35751
9.0      33570
6.0      25652
12.0     24446
1.0      20678
4.0      20451
2.0       7479
13.0      5764
7.0       4010
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64

'''
From the data dictionary -
-  1: 40s - war years (Mainstream, E+W)
-  2: 40s - reconstruction years (Avantgarde, E+W)
-  3: 50s - economic miracle (Mainstream, E+W)
-  4: 50s - milk bar / Individualisation (Avantgarde, E+W)
-  5: 60s - economic miracle (Mainstream, E+W)
-  6: 60s - generation 68 / student protestors (Avantgarde, W)
-  7: 60s - opponents to the building of the Wall (Avantgarde, E)
-  8: 70s - family orientation (Mainstream, E+W)
-  9: 70s - peace movement (Avantgarde, E+W)
- 10: 80s - Generation Golf (Mainstream, W)
- 11: 80s - ecological awareness (Avantgarde, W)
- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
- 13: 80s - Swords into ploughshares (Avantgarde, E)
- 14: 90s - digital media kids (Mainstream, E+W)
- 15: 90s - ecological awareness (Avantgarde, E+W)
'''
# will code Avantgard = 0 and Mainstream = 1
# 40s = 0, 50s = 1, 60s = 2, 70s = 3, 80s = 4, 90s = 5

'\nFrom the data dictionary -\n-  1: 40s - war years (Mainstream, E+W)\n-  2: 40s - reconstruction years (Avantgarde, E+W)\n-  3: 50s - economic miracle (Mainstream, E+W)\n-  4: 50s - milk bar / Individualisation (Avantgarde, E+W)\n-  5: 60s - economic miracle (Mainstream, E+W)\n-  6: 60s - generation 68 / student protestors (Avantgarde, W)\n-  7: 60s - opponents to the building of the Wall (Avantgarde, E)\n-  8: 70s - family orientation (Mainstream, E+W)\n-  9: 70s - peace movement (Avantgarde, E+W)\n- 10: 80s - Generation Golf (Mainstream, W)\n- 11: 80s - ecological awareness (Avantgarde, W)\n- 12: 80s - FDJ / communist party youth organisation (Mainstream, E)\n- 13: 80s - Swords into ploughshares (Avantgarde, E)\n- 14: 90s - digital media kids (Mainstream, E+W)\n- 15: 90s - ecological awareness (Avantgarde, E+W)\n'

decade = {0: [1, 2], 1: [3, 4], 2: [5, 6, 7], 3: [8, 9], 4: [10, 11, 12, 13], 5: [14, 15]}
movement = {0: [2, 4, 6, 7, 9, 11, 13, 15], 1: [1, 3, 5, 8, 10, 12, 14]}

#creating new columns 
data_low['DECADE'] = data_low['PRAEGENDE_JUGENDJAHRE']
data_low['MOVEMENT'] = data_low['PRAEGENDE_JUGENDJAHRE']

#checking that the columns populated correctly
data_low.head()

#creating the maps for the lists in the dict
decade_map = {value: key for key, values in decade.items() for value in values}
movement_map= {value: key for key, values in movement.items() for value in values}

#replacing and checking decade column
data_low['DECADE'] = data_low['DECADE'].replace(decade_map)
data_low['DECADE'].head()

1    5.0
2    5.0
3    3.0
4    3.0
5    1.0
Name: DECADE, dtype: float64

#drop old column
data_low = data_low.drop(columns = 'PRAEGENDE_JUGENDJAHRE')

#replacing and checking decade column
data_low['MOVEMENT'] = data_low['MOVEMENT'].replace(movement_map)
data_low['MOVEMENT'].head()

1    1.0
2    0.0
3    1.0
4    1.0
5    1.0
Name: MOVEMENT, dtype: float64

# Investigate "CAMEO_INTL_2015" and engineer two new variables.
data_low['CAMEO_INTL_2015'].head()

1    51
2    24
3    12
4    43
5    54
Name: CAMEO_INTL_2015, dtype: object

'''
From the data dictionary 
German CAMEO: Wealth / Life Stage Typology, mapped to international code
- -1: unknown
- 11: Wealthy Households - Pre-Family Couples & Singles
- 12: Wealthy Households - Young Couples With Children
- 13: Wealthy Households - Families With School Age Children
- 14: Wealthy Households - Older Families &  Mature Couples
- 15: Wealthy Households - Elders In Retirement
- 21: Prosperous Households - Pre-Family Couples & Singles
- 22: Prosperous Households - Young Couples With Children
- 23: Prosperous Households - Families With School Age Children
- 24: Prosperous Households - Older Families & Mature Couples
- 25: Prosperous Households - Elders In Retirement
- 31: Comfortable Households - Pre-Family Couples & Singles
- 32: Comfortable Households - Young Couples With Children
- 33: Comfortable Households - Families With School Age Children
- 34: Comfortable Households - Older Families & Mature Couples
- 35: Comfortable Households - Elders In Retirement
- 41: Less Affluent Households - Pre-Family Couples & Singles
- 42: Less Affluent Households - Young Couples With Children
- 43: Less Affluent Households - Families With School Age Children
- 44: Less Affluent Households - Older Families & Mature Couples
- 45: Less Affluent Households - Elders In Retirement
- 51: Poorer Households - Pre-Family Couples & Singles
- 52: Poorer Households - Young Couples With Children
- 53: Poorer Households - Families With School Age Children
- 54: Poorer Households - Older Families & Mature Couples
- 55: Poorer Households - Elders In Retirement
- XX: unknown
'''

'\nFrom the data dictionary \nGerman CAMEO: Wealth / Life Stage Typology, mapped to international code\n- -1: unknown\n- 11: Wealthy Households - Pre-Family Couples & Singles\n- 12: Wealthy Households - Young Couples With Children\n- 13: Wealthy Households - Families With School Age Children\n- 14: Wealthy Households - Older Families &  Mature Couples\n- 15: Wealthy Households - Elders In Retirement\n- 21: Prosperous Households - Pre-Family Couples & Singles\n- 22: Prosperous Households - Young Couples With Children\n- 23: Prosperous Households - Families With School Age Children\n- 24: Prosperous Households - Older Families & Mature Couples\n- 25: Prosperous Households - Elders In Retirement\n- 31: Comfortable Households - Pre-Family Couples & Singles\n- 32: Comfortable Households - Young Couples With Children\n- 33: Comfortable Households - Families With School Age Children\n- 34: Comfortable Households - Older Families & Mature Couples\n- 35: Comfortable Households - Elders In Retirement\n- 41: Less Affluent Households - Pre-Family Couples & Singles\n- 42: Less Affluent Households - Young Couples With Children\n- 43: Less Affluent Households - Families With School Age Children\n- 44: Less Affluent Households - Older Families & Mature Couples\n- 45: Less Affluent Households - Elders In Retirement\n- 51: Poorer Households - Pre-Family Couples & Singles\n- 52: Poorer Households - Young Couples With Children\n- 53: Poorer Households - Families With School Age Children\n- 54: Poorer Households - Older Families & Mature Couples\n- 55: Poorer Households - Elders In Retirement\n- XX: unknown\n'

#creating new columns
data_low['WEALTH'] = data_low['CAMEO_INTL_2015']
data_low['LIFE_STAGE'] = data_low['CAMEO_INTL_2015']
data_low.head()

#slicing the strings
data_low['WEALTH']= data_low['WEALTH'].astype(str).str[-1]

data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(str).str[1:]
data_low['LIFE_STAGE'].head()

1    1
2    4
3    2
4    3
5    4
Name: LIFE_STAGE, dtype: object

#checking the values
data_low['LIFE_STAGE'].unique()

array(['1', '4', '2', '3', '5', 'an'], dtype=object)

#creating masks to fix the null values
m = data_low['LIFE_STAGE'].str.contains('an', case=False, regex=True)
ma = data_low['WEALTH'].str.contains('n', case=False, regex=True)

#fixing the null values
data_low.loc[m, 'LIFE_STAGE'] = np.nan
data_low.loc[ma, 'WEALTH'] = np.nan

#double checking 
data_low['LIFE_STAGE'].unique()

array(['1', '4', '2', '3', '5', nan], dtype=object)

data_low['WEALTH'].unique()

array(['1', '4', '2', '3', '5', nan], dtype=object)

#changing the type
data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(float)
data_low['WEALTH']= data_low['WEALTH'].astype(float)

#dropping the orginal column
data_low = data_low.drop(columns = 'CAMEO_INTL_2015')

#dropping mixed columns that have duplicate info
data_low = data_low.drop(columns = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB'])

#checking the data type of this column 
data_low['WOHNLAGE'].describe()

count    797961.000000
mean          4.053006
std           1.949352
min           0.000000
25%           3.000000
50%           3.000000
75%           5.000000
max           8.000000
Name: WOHNLAGE, dtype: float64

data_low['PLZ8_BAUMAX'].describe()

count    774706.000000
mean          1.943913
std           1.459654
min           1.000000
25%           1.000000
50%           1.000000
75%           3.000000
max           5.000000
Name: PLZ8_BAUMAX, dtype: float64

Discussion 1.2.2: Engineer Mixed-Type Features¶

We generated two new columns, 'MOVEMENT' and 'DECADE,' from the original mixed feature "PRAEGENDE_JUGENDJAHRE." These new columns were created using data from the original feature and information from a data dictionary. In this process, we devised two dictionaries to specify the new values for these columns. 'MOVEMENT' was assigned a binary variable, 1 or 0, to distinguish between mainstream and avant-garde, while 'DECADE' was assigned an interval-type variable.

After verifying that the data had been correctly transferred, we removed the original "PRAEGENDE_JUGENDJAHRE" column.

Similarly, for the "CAMEO_INTL_2015" column, which contained two-digit integers representing 'WEALTH' and 'LIFE_STAGE,' we separated the tens-place to represent wealth and the ones-place value to represent life stage. Once we ensured the data was accurate, we addressed any null values and subsequently dropped the original "CAMEO_INTL_2015" column.

Per the data dictionary, the columns 'WOHNLAGE' and 'PLZ8_BAUMAX' are already an interval-type variable so these did not need to be changed. Columns 'LP_LEBENSPHASE_FEIN' and 'LP_LEBENSPHASE_GROB' include life stage infomation. These columns were dropped since we already have that information.

 data_low.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 797961 entries, 1 to 891220
Data columns (total 67 columns):
ALTERSKATEGORIE_GROB     795160 non-null float64
ANREDE_KZ                797961 non-null int64
FINANZ_MINIMALIST        797961 non-null int64
FINANZ_SPARER            797961 non-null int64
FINANZ_VORSORGER         797961 non-null int64
FINANZ_ANLEGER           797961 non-null int64
FINANZ_UNAUFFAELLIGER    797961 non-null int64
FINANZ_HAUSBAUER         797961 non-null int64
GREEN_AVANTGARDE         797961 non-null int64
HEALTH_TYP               761281 non-null float64
RETOURTYP_BK_S           793248 non-null float64
SEMIO_SOZ                797961 non-null int64
SEMIO_FAM                797961 non-null int64
SEMIO_REL                797961 non-null int64
SEMIO_MAT                797961 non-null int64
SEMIO_VERT               797961 non-null int64
SEMIO_LUST               797961 non-null int64
SEMIO_ERL                797961 non-null int64
SEMIO_KULT               797961 non-null int64
SEMIO_RAT                797961 non-null int64
SEMIO_KRIT               797961 non-null int64
SEMIO_DOM                797961 non-null int64
SEMIO_KAEM               797961 non-null int64
SEMIO_PFLICHT            797961 non-null int64
SEMIO_TRADV              797961 non-null int64
SOHO_KZ                  797961 non-null float64
VERS_TYP                 761281 non-null float64
ANZ_PERSONEN             797961 non-null float64
ANZ_TITEL                797961 non-null float64
HH_EINKOMMEN_SCORE       797961 non-null float64
W_KEIT_KIND_HH           738718 non-null float64
WOHNDAUER_2008           797961 non-null float64
ANZ_HAUSHALTE_AKTIV      791535 non-null float64
ANZ_HH_TITEL             794138 non-null float64
KONSUMNAEHE              797891 non-null float64
MIN_GEBAEUDEJAHR         797961 non-null float64
OST_WEST_KZ              797961 non-null int64
WOHNLAGE                 797961 non-null float64
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797369 non-null float64
EWDICHTE                 797369 non-null float64
INNENSTADT               797369 non-null float64
GEBAEUDETYP_RASTER       797954 non-null float64
KKK                      733146 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        793248 non-null float64
REGIOTYP                 733146 non-null float64
KBA13_ANZAHL_PKW         785412 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   793734 non-null float64
ORTSGR_KLS9              793835 non-null float64
RELAT_AB                 793734 non-null float64
row_null_count           797961 non-null int64
DECADE                   769252 non-null float64
MOVEMENT                 769252 non-null float64
WEALTH                   791840 non-null float64
LIFE_STAGE               791840 non-null float64
dtypes: float64(43), int64(24)
memory usage: 434.0 MB

#dropping row_null_count since it is no longer needed
data_low = data_low.drop(columns = 'row_null_count')

def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    # convert missing value codes into NaNs, ...
    for attribute, miss in zip(feat_info['attribute'], feat_info['missing_or_unknown']):
        missing_values = miss.strip('[]').split(',')
        miss_list = ['X', 'XX', ''] #including X and XX due to errors when running the following int conversion
        missing_values = [int(value) if (value not in miss_list) else value for value in missing_values]
        if missing_values != ['']:
            df[attribute] = df[attribute].replace(missing_values, np.nan)
    
    # remove selected columns and rows, ...
    outliers = ['TITEL_KZ',  'AGER_TYP',  'KK_KUNDENTYP',  'KBA05_BAUMAX',  'GEBURTSJAHR',  'ALTER_HH']
    
    df = df.drop(columns=outliers)
    
    df['row_null_count'] =   df.isnull().sum(axis=1)
          
    data_low = df.query('row_null_count <= 25')
        
    # select, re-encode, and engineer column values.
   
    #### correcting column 'OST_WEST_KZ'
    data_low['OST_WEST_KZ'].replace('O', 1, inplace = True)
    data_low['OST_WEST_KZ'].replace('W', 0, inplace = True)
    
    #### dropping multi-feature columns
    
    multi_feat = ['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN',
                  'LP_STATUS_GROB', 'NATIONALITAET_KZ', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'CAMEO_DEUG_2015',
                  'CAMEO_DEU_2015']
    
    data_low= data_low.drop(multi_feat, axis = 1)
    
    ###correcting "PRAEGENDE_JUGENDJAHRE"
    
    decade = {0: [1, 2], 1: [3, 4], 2: [5, 6, 7], 3: [8, 9], 4: [10, 11, 12, 13], 5: [14, 15]}
    movement = {0: [2, 4, 6, 7, 9, 11, 13, 15], 1: [1, 3, 5, 8, 10, 12, 14]}
    
    data_low['DECADE'] = data_low['PRAEGENDE_JUGENDJAHRE']
    data_low['MOVEMENT'] = data_low['PRAEGENDE_JUGENDJAHRE']
    
    decade_map = {value: key for key, values in decade.items() for value in values}
    movement_map= {value: key for key, values in movement.items() for value in values}
    data_low['DECADE'] = data_low['DECADE'].replace(decade_map)
    data_low['MOVEMENT'] = data_low['MOVEMENT'].replace(movement_map)
    
    data_low = data_low.drop(columns = 'PRAEGENDE_JUGENDJAHRE')
    
    
    ####correcting  'CAMEO_INTL_2015'
    
    data_low['WEALTH'] = data_low['CAMEO_INTL_2015']
    data_low['LIFE_STAGE'] = data_low['CAMEO_INTL_2015']
    
    data_low['WEALTH']= data_low['WEALTH'].astype(str).str[-1]
    data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(str).str[1:]
    
    m = data_low['LIFE_STAGE'].str.contains('an', case=False, regex=True)
    ma = data_low['WEALTH'].str.contains('n', case=False, regex=True)
    
    data_low.loc[m, 'LIFE_STAGE'] = np.nan
    data_low.loc[ma, 'WEALTH'] = np.nan
    
    data_low['LIFE_STAGE'] = data_low['LIFE_STAGE'].astype(float)
    data_low['WEALTH']= data_low['WEALTH'].astype(float)
    
    data_low = data_low.drop(columns = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'CAMEO_INTL_2015', 'row_null_count'])
    
    ####fill nulls
    fill_null = Imputer(strategy='median')
    new_df = pd.DataFrame(fill_null.fit_transform(data_low), columns=data_low.columns)
    
          
    
    # Return the cleaned dataframe.
    return new_df

Step 2: Feature Transformation¶

Step 2.1: Apply Feature Scaling¶

Before applying dimensionality reduction, I performed feature scaling to ensure that the principal component vectors are not influenced by differences in feature scales. Using the `StandardScaler` from `sklearn`, I scaled each feature to have a mean of 0 and a standard deviation of 1.

Before scaling, I made sure the data was free of missing values. I considered different approaches, such as removing data points with missing values or applying imputation, and selected the method that best suited the amount of missing data in the dataset.

Finally, I used the `.fit_transform()` method to both fit the scaler and transform the data, ensuring that the scaling parameters are saved for later use on customer demographics data.

(data_low.shape[0], data_low.isna().sum().sum())

(797961, 784520)

#will fill in the null values with the median value for the column
fill_null = Imputer(strategy='median')

#creating a new df with the null values filled in 
#used https://stackoverflow.com/questions/33660836/impute-entire-dataframe-all-columns-using-scikit-learn-sklearn-without-itera
# for syntax
df = pd.DataFrame(fill_null.fit_transform(data_low), columns=data_low.columns)

##checking to make sure the column names transfered over like expected
df.head()

#checking to make sure there are no null values
df.isna().sum().sum()

0

# Apply feature scaling to the general population demographics data.

s = StandardScaler()
s_df = s.fit(df)
s_df = s.transform(df)
s_df = pd.DataFrame(s_df, columns=df.columns)

Discussion 2.1: Apply Feature Scaling¶

Several fields within the dataset contained null values, making it impractical to remove all of them. To address this issue, an imputer was employed to replace the null values with the median value from the respective column in which the null value is located. This approach was chosen in an effort to minimize the potential impact on the subsequent analysis.

Step 2.2: Perform Dimensionality Reduction¶

Now that the data is scaled, I applied Principal Component Analysis (PCA) using `sklearn`'s PCA class. Initially, I computed all components or set the number of components to at least half the number of features to capture the general trend in variability.

I examined the explained variance ratio for each principal component and plotted the cumulative variance using Matplotlib's `plot()` function. Based on the results, I selected the optimal number of components to retain for clustering.

Finally, I re-fitted the PCA instance to transform the data according to the chosen number of components. This reduced dataset will be used for the next steps in the project.

# Apply PCA to the data.

pca = PCA()
fit_pca = pca.fit(s_df)

# Investigate the variance accounted for by each principal component.

fit_pca.explained_variance_ratio_

array([  1.67540868e-01,   1.30954918e-01,   9.20538111e-02,
         5.42892085e-02,   3.67998700e-02,   3.11986776e-02,
         2.82075284e-02,   2.60515088e-02,   2.41331331e-02,
         2.25180962e-02,   2.04073361e-02,   1.98856745e-02,
         1.93761885e-02,   1.82758532e-02,   1.74221114e-02,
         1.59129820e-02,   1.51718009e-02,   1.51053854e-02,
         1.38727726e-02,   1.26732798e-02,   1.24719146e-02,
         1.19588949e-02,   1.12591296e-02,   1.11836029e-02,
         1.07881928e-02,   9.75875336e-03,   8.34306738e-03,
         7.63627770e-03,   7.51405976e-03,   7.20694726e-03,
         6.83566326e-03,   6.58537886e-03,   6.32265080e-03,
         6.30144501e-03,   6.05676985e-03,   5.76150818e-03,
         5.31724527e-03,   4.85865026e-03,   4.52805073e-03,
         4.34663869e-03,   4.14812609e-03,   3.97846688e-03,
         3.92382104e-03,   3.87133894e-03,   3.73531437e-03,
         3.68641667e-03,   3.36823989e-03,   3.08431672e-03,
         3.06347118e-03,   2.93730070e-03,   2.72532828e-03,
         2.62443466e-03,   2.52705951e-03,   2.47907070e-03,
         2.20607505e-03,   2.15555092e-03,   2.11096047e-03,
         1.87877035e-03,   1.85398874e-03,   1.71349741e-03,
         1.59687887e-03,   1.38681236e-03,   1.24691531e-03,
         8.12000096e-04,   3.61057236e-29,   1.30142117e-30])

#scree plot
plt.figure(figsize=(10, 6))
plt.plot(range(fit_pca.explained_variance_ratio_.shape[0]), np.cumsum(fit_pca.explained_variance_ratio_))
plt.xlabel("Number of Components")
plt.ylabel("Variance Explained")
plt.show();

# Re-apply PCA to the data while selecting for number of components to retain.

pca = PCA(n_components = 30)
data_pca = pca.fit_transform(s_df)

data_pca

array([[ 3.52742287, -2.9068797 , -2.99129266, ..., -0.59703226,
        -0.91727316, -0.38434441],
       [-0.89428618,  0.35842381, -3.01412002, ..., -0.51740868,
         0.29141884,  1.26841119],
       [-3.65306058,  1.38007524, -0.7385278 , ..., -0.84820399,
         0.08152276,  0.03016591],
       ..., 
       [-1.30792235, -3.40585942, -2.98604858, ..., -0.96157144,
         1.0709555 , -0.72954257],
       [ 6.04254242, -3.50755208,  2.30113938, ..., -0.49298219,
         0.26505233, -0.20169116],
       [-1.01301685,  1.00266843,  3.02767496, ...,  0.03946604,
        -0.10079208,  0.47063208]])

pca.explained_variance_ratio_.sum()

0.87976410027503449

Discussion 2.2: Perform Dimensionality Reduction¶

Following the initial fitting of the data and the generation of a scree plot, the decision was made to utilize 30 principal components. This choice was made as additional components beyond 30 did not appear to significantly impact the results.

Step 2.3: Interpret Principal Components¶

With our transformed principal components, it’s helpful to examine the influence of each original feature on the first few components. Each principal component is a unit vector pointing in the direction of highest variance. Features with weights far from zero have the most influence on that component.

To analyze this, I mapped each weight to its corresponding feature, sorted the features by their weights, and focused on the most prominent features at both ends of the list. Features with large weights of the same sign suggest positive correlations, while opposite signs indicate negative correlations.

I explored the feature associations for the first three principal components by writing a function to print the sorted list of feature weights for any component. This function will be useful later when interpreting cluster tendencies. The data dictionary helped me understand the relationships between these prominent features and what their positive or negative values might signify.

#map weights for the first principal component to corresponding feature names and then print the linked values, sorted by weight.
def weights(pca, i):
    '''
    i = index for desired principle component
    '''
    weight_map = {}
    for j, feature in enumerate(s_df.columns):
        weight_map[feature] = pca.components_[i][j]
    
    sorted_weight = sorted(weight_map.items(), key=operator.itemgetter(1), reverse=True)
    
    return sorted_weight

#using function to map first principal component
first = weights(pca, 0)
first

[('PLZ8_ANTG3', 0.22908592233336147),
 ('PLZ8_ANTG4', 0.22430357831286174),
 ('PLZ8_BAUMAX', 0.21992832498780263),
 ('ORTSGR_KLS9', 0.20416644554340607),
 ('EWDICHTE', 0.20166856630027075),
 ('HH_EINKOMMEN_SCORE', 0.19360376785230096),
 ('FINANZ_HAUSBAUER', 0.15927249871805191),
 ('KBA05_ANTG4', 0.15520407385406654),
 ('PLZ8_ANTG2', 0.15487533229296474),
 ('ARBEIT', 0.1431528879208778),
 ('RELAT_AB', 0.13606968495827093),
 ('ANZ_HAUSHALTE_AKTIV', 0.13541678544851257),
 ('KBA05_ANTG3', 0.13458849488234492),
 ('FINANZ_SPARER', 0.13386510585103048),
 ('SEMIO_PFLICHT', 0.10359297244155434),
 ('SEMIO_REL', 0.10112237251768988),
 ('MOVEMENT', 0.099453530609906804),
 ('DECADE', 0.089153748313561473),
 ('SEMIO_RAT', 0.084981645092439689),
 ('SEMIO_TRADV', 0.078566800877228643),
 ('SEMIO_FAM', 0.071230085229593176),
 ('SEMIO_MAT', 0.070871718076377418),
 ('FINANZ_UNAUFFAELLIGER', 0.064852514545607823),
 ('SEMIO_KULT', 0.064036160284686727),
 ('REGIOTYP', 0.063746057454362737),
 ('FINANZ_ANLEGER', 0.056196595305644047),
 ('OST_WEST_KZ', 0.053064061137824338),
 ('PLZ8_HHZ', 0.042926286017708468),
 ('SEMIO_KAEM', 0.041544637414166148),
 ('SEMIO_SOZ', 0.041011150183637886),
 ('W_KEIT_KIND_HH', 0.037897853187708969),
 ('ANZ_HH_TITEL', 0.035226048166157438),
 ('VERS_TYP', 0.033442637501989794),
 ('KKK', 0.032870685755780046),
 ('HEALTH_TYP', 0.027512268699893424),
 ('SEMIO_DOM', 0.025945318496043576),
 ('ANREDE_KZ', 0.0078367499807837637),
 ('KBA05_ANTG2', 0.0075041112858347213),
 ('SEMIO_KRIT', 0.0060535264256318519),
 ('SOHO_KZ', -0.0019865543819795714),
 ('ANZ_TITEL', -0.0027828401474620586),
 ('RETOURTYP_BK_S', -0.011302777243340438),
 ('SEMIO_VERT', -0.038988315846222678),
 ('ONLINE_AFFINITAET', -0.046600046591248764),
 ('MIN_GEBAEUDEJAHR', -0.048460561591964463),
 ('WOHNDAUER_2008', -0.054884103246612873),
 ('SEMIO_LUST', -0.063301529641645857),
 ('SEMIO_ERL', -0.067839693916990823),
 ('ANZ_PERSONEN', -0.076763293553209211),
 ('KBA13_ANZAHL_PKW', -0.07995246818126768),
 ('WOHNLAGE', -0.088931645976290233),
 ('GREEN_AVANTGARDE', -0.099453530609906832),
 ('FINANZ_VORSORGER', -0.10302650533608686),
 ('ALTERSKATEGORIE_GROB', -0.10720676129182649),
 ('GEBAEUDETYP_RASTER', -0.12095176307766102),
 ('WEALTH', -0.12950640476987929),
 ('LIFE_STAGE', -0.12950640476987929),
 ('BALLRAUM', -0.13257140142572491),
 ('INNENSTADT', -0.17031285044653363),
 ('PLZ8_GBZ', -0.17061042091490067),
 ('KONSUMNAEHE', -0.1713217763858923),
 ('FINANZ_MINIMALIST', -0.20932455729167729),
 ('KBA05_GBZ', -0.21375667441369556),
 ('KBA05_ANTG1', -0.21856006684624874),
 ('PLZ8_ANTG1', -0.22857048102401209),
 ('MOBI_REGIO', -0.23897465597929127)]

#map weights for the second principal component to corresponding feature names and then print the linked values, sorted by weight.
second = weights(pca, 1)
second

[('ALTERSKATEGORIE_GROB', 0.26398124819457736),
 ('FINANZ_VORSORGER', 0.23706847930109262),
 ('SEMIO_ERL', 0.23157860619463669),
 ('SEMIO_LUST', 0.18656597244578674),
 ('RETOURTYP_BK_S', 0.16433822110328894),
 ('W_KEIT_KIND_HH', 0.11705869641467315),
 ('SEMIO_KRIT', 0.11327231214960157),
 ('SEMIO_KAEM', 0.1092681328338912),
 ('FINANZ_HAUSBAUER', 0.1080478133006918),
 ('ANREDE_KZ', 0.087906284345808494),
 ('EWDICHTE', 0.086305771467459644),
 ('ORTSGR_KLS9', 0.085258015514567492),
 ('PLZ8_ANTG3', 0.082374072496764414),
 ('PLZ8_ANTG4', 0.079154302418402545),
 ('FINANZ_MINIMALIST', 0.077722670304169517),
 ('PLZ8_BAUMAX', 0.076124483893556943),
 ('SEMIO_DOM', 0.069889950233133255),
 ('KBA05_ANTG4', 0.06378432363914173),
 ('WOHNDAUER_2008', 0.063366495939497364),
 ('ARBEIT', 0.061255174008392116),
 ('RELAT_AB', 0.059165486068992794),
 ('PLZ8_ANTG2', 0.056702504786110122),
 ('ANZ_HAUSHALTE_AKTIV', 0.055860048863868861),
 ('HH_EINKOMMEN_SCORE', 0.041806829731465718),
 ('KBA05_ANTG3', 0.038750746709004358),
 ('VERS_TYP', 0.030643150030845289),
 ('ANZ_HH_TITEL', 0.030005059753134455),
 ('OST_WEST_KZ', 0.022614272622946047),
 ('PLZ8_HHZ', 0.011122230775684786),
 ('WEALTH', 0.011090605122836631),
 ('LIFE_STAGE', 0.011090605122836631),
 ('REGIOTYP', 0.011038567872250236),
 ('ANZ_TITEL', 0.0079108194866564542),
 ('MOVEMENT', 0.0048038907451233466),
 ('SOHO_KZ', -0.0017391599126611557),
 ('GREEN_AVANTGARDE', -0.0048038907451233483),
 ('KKK', -0.0059973373656140817),
 ('KBA05_ANTG2', -0.0094530199544385057),
 ('KBA13_ANZAHL_PKW', -0.032402604663517937),
 ('GEBAEUDETYP_RASTER', -0.039603846636620831),
 ('MIN_GEBAEUDEJAHR', -0.046191948169849652),
 ('WOHNLAGE', -0.05356132942026897),
 ('ANZ_PERSONEN', -0.055903449141959216),
 ('BALLRAUM', -0.057572673359885147),
 ('HEALTH_TYP', -0.058658382123241447),
 ('KBA05_ANTG1', -0.062278266847252878),
 ('KONSUMNAEHE', -0.063423644793314693),
 ('PLZ8_GBZ', -0.063484071410495585),
 ('SEMIO_VERT', -0.06585480397516906),
 ('INNENSTADT', -0.069578033577379364),
 ('KBA05_GBZ', -0.074824639506369126),
 ('MOBI_REGIO', -0.074923911281750452),
 ('PLZ8_ANTG1', -0.079509583926499497),
 ('SEMIO_SOZ', -0.10247130202515178),
 ('ONLINE_AFFINITAET', -0.16074095503245944),
 ('SEMIO_MAT', -0.16558999454827047),
 ('SEMIO_RAT', -0.17447791221172163),
 ('SEMIO_FAM', -0.18547491392053894),
 ('FINANZ_ANLEGER', -0.2093922625861343),
 ('SEMIO_KULT', -0.22068774641907682),
 ('FINANZ_UNAUFFAELLIGER', -0.22900715423366935),
 ('SEMIO_PFLICHT', -0.23189856678028856),
 ('SEMIO_TRADV', -0.23397563819516926),
 ('FINANZ_SPARER', -0.24177359474772939),
 ('DECADE', -0.25541667247665412),
 ('SEMIO_REL', -0.25759039406240308)]

#map weights for the third principal component to corresponding feature names
third = weights(pca, 2)
third

[('SEMIO_VERT', 0.34603193736030574),
 ('SEMIO_SOZ', 0.26221075542260047),
 ('SEMIO_FAM', 0.24964240292722584),
 ('SEMIO_KULT', 0.23528270980109112),
 ('FINANZ_MINIMALIST', 0.16020212188483562),
 ('RETOURTYP_BK_S', 0.1069412701183394),
 ('FINANZ_VORSORGER', 0.10025983607990267),
 ('W_KEIT_KIND_HH', 0.087891388133503584),
 ('ALTERSKATEGORIE_GROB', 0.078045068648507243),
 ('SEMIO_REL', 0.069649587968750287),
 ('SEMIO_LUST', 0.063973029414174057),
 ('SEMIO_MAT', 0.055925739128069696),
 ('GREEN_AVANTGARDE', 0.054456994865036717),
 ('ORTSGR_KLS9', 0.045805576481800322),
 ('PLZ8_BAUMAX', 0.045381780080264451),
 ('EWDICHTE', 0.044859211258721804),
 ('PLZ8_ANTG4', 0.042532364844569834),
 ('PLZ8_ANTG3', 0.041498385917121169),
 ('WOHNDAUER_2008', 0.033465984391002096),
 ('ARBEIT', 0.032177798324851369),
 ('RELAT_AB', 0.02949374971255158),
 ('PLZ8_ANTG2', 0.027473140050248871),
 ('KBA05_ANTG4', 0.024510079339828284),
 ('ANZ_HAUSHALTE_AKTIV', 0.021618211743664257),
 ('OST_WEST_KZ', 0.013687775986077705),
 ('ANZ_HH_TITEL', 0.01270711431726565),
 ('ANZ_TITEL', 0.010203516098173456),
 ('KBA05_ANTG3', 0.0057225095769092718),
 ('PLZ8_HHZ', 0.0038961912292575151),
 ('VERS_TYP', 0.0015124679071747496),
 ('SOHO_KZ', 0.00030326900707413002),
 ('ANZ_PERSONEN', -0.0055473837499681287),
 ('WEALTH', -0.0060686963295723019),
 ('LIFE_STAGE', -0.0060686963295723019),
 ('REGIOTYP', -0.0093004144740783858),
 ('KBA05_ANTG2', -0.010978895285893147),
 ('KBA05_ANTG1', -0.015602319591368555),
 ('MIN_GEBAEUDEJAHR', -0.015989697578717554),
 ('KKK', -0.019424265524786512),
 ('KBA05_GBZ', -0.020306368883477019),
 ('HEALTH_TYP', -0.02133629345567974),
 ('KBA13_ANZAHL_PKW', -0.022204229912108343),
 ('WOHNLAGE', -0.022575331255141676),
 ('MOBI_REGIO', -0.022931552591587464),
 ('HH_EINKOMMEN_SCORE', -0.026719570593222802),
 ('GEBAEUDETYP_RASTER', -0.029272065277585711),
 ('BALLRAUM', -0.034432908214344991),
 ('KONSUMNAEHE', -0.035714140499739734),
 ('PLZ8_GBZ', -0.035745117484970786),
 ('INNENSTADT', -0.04172654862898259),
 ('PLZ8_ANTG1', -0.04211108898452906),
 ('FINANZ_HAUSBAUER', -0.047569046914246516),
 ('ONLINE_AFFINITAET', -0.050425330774912631),
 ('MOVEMENT', -0.054456994865036724),
 ('SEMIO_TRADV', -0.077676098704074797),
 ('SEMIO_PFLICHT', -0.078435070103864801),
 ('FINANZ_UNAUFFAELLIGER', -0.097802282982882849),
 ('DECADE', -0.10534726144201183),
 ('FINANZ_SPARER', -0.10666619622703408),
 ('SEMIO_ERL', -0.1770974583807628),
 ('FINANZ_ANLEGER', -0.18976424945433035),
 ('SEMIO_RAT', -0.21664439490992132),
 ('SEMIO_KRIT', -0.2766041774758306),
 ('SEMIO_DOM', -0.31311834574445535),
 ('SEMIO_KAEM', -0.33716635393319166),
 ('ANREDE_KZ', -0.36853895173068496)]

Discussion 2.3: Interpret Principal Components¶

The first principal component, PLZ8_ANTG3, represents the number of 6-10 family (large family) houses in the region and has a positive influence. On the other hand, the feature with the most negative weight, MOBI_REGIO, characterizes movement patterns.

In the second principal component, the most positively weighted feature is ALTERSKATEGORIE_GROB, which estimates age based on an analysis of their name. Conversely, the feature with the most negative weight in this component is SEMIO_REL, which identifies their religious personality typology.

The third principal component is characterized by the feature with the largest positive weight, SEMIO_VERT, representing a dreamful personality typology. The feature with the largest negative weight in this component is ANREDE_KZ, which corresponds to gender.

Step 3: Clustering¶

Step 3.1: Apply Clustering to General Population¶

K-Means Clustering Process

Now that the data is scaled and transformed, it's time to apply k-means clustering to see how the data groups in the principal components space. Here's the approach I took:

1. K-Means Clustering - I used sklearn's KMeans class to perform clustering on the PCA-transformed data.
2. Evaluate Cluster Quality - For each cluster count, I calculated the average distance from each point to its assigned cluster center. I used the .score() method, keeping in mind that sklearn's scores are usually better when larger, so I reversed the sign to interpret the results correctly.
3. Determine Optimal Cluster Count - I tested various cluster counts, observing how the average distance decreases with more clusters. The goal was to find the point where adding more clusters yields diminishing returns, and I chose the optimal cluster count based on this analysis.
4. Final Clustering - Once I determined the final number of clusters, I re-fitted the KMeans instance and saved the cluster assignments for use in the next steps.

This process will help identify the natural groupings in the data and set the stage for further analysis.

# Over a number of different cluster counts...
k = list(range(5, 20, 2)) #list 5-19 odd

def get_kmeans(data, k):
    # run k-means clustering on the data and...
    kmeans = KMeans(n_clusters=k)
    model = kmeans.fit(data)
    # compute the average within-cluster distances.
    score = np.abs(model.score(data))
    
    return score

# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.

scores = []

for i in k:
    scores.append(get_kmeans(data_pca, i))

#scree plot
plt.plot(k, scores, linestyle='solid', marker='o', color='g');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');

# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans = KMeans(n_clusters=15)
model = kmeans.fit(data_pca)
gen_pop_predict = model.predict(data_pca)

#finding cluster centers
cluster_centers = model.cluster_centers_

#scatter plot of the data points for each cluster
plt.figure(figsize=(8, 6))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=gen_pop_predict, cmap='PRGn')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='o', s=50, label='Cluster Centers')
plt.title('K-Means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend();

Discussion 3.1: Apply Clustering to General Population¶

According to the scree plot, it becomes evident that having 15 clusters is the point at which additional clusters cease to provide any significant benefits.

Step 3.2: Apply All Steps to the Customer Data¶

Now that the clusters and cluster centers have been identified for the general population, the next step is to map the customer data onto these clusters. This process involves applying the transformations and fits from the general population to the customer data without re-fitting the models

# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv', sep=';')

customers.head()

customers.isna().sum().sum()

2252274

customers.shape[0]

191652

# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.

clean_customers = clean_data(customers)

clean_customers.head()

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py:5890: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

clean_customers.shape[0]

141713

#scale
cust_s = s.transform(clean_customers)
cust_s = pd.DataFrame(cust_s, columns=clean_customers.columns)

cust_s.head()

#tranforming customers using pca
pca_customers = pca.transform(cust_s)
pca_customers

array([[-5.64554919,  1.07684711,  4.35430184, ..., -0.34901697,
        -0.0066858 ,  0.11374425],
       [-1.1509697 ,  5.11332146, -0.57458706, ...,  0.26272396,
         0.63083949, -1.01330661],
       [-4.56911491,  0.41370031,  2.71558137, ..., -0.46464984,
        -0.45299421, -0.54732898],
       ..., 
       [-4.28081253,  2.91133888,  2.50369361, ...,  0.99851939,
         0.9955746 ,  0.14542296],
       [-0.92848765,  2.84894987, -1.9673653 , ..., -0.78756383,
         0.65353315,  0.49888096],
       [-3.95904599, -2.6956019 ,  2.32073937, ..., -0.85446533,
         0.38935444,  1.36706512]])

#predicting clusters
cust_predict = model.predict(pca_customers)
cust_predict

array([11,  3,  2, ..., 11,  4,  2], dtype=int32)

cust_predict.shape

(141713,)

Step 3.3: Compare Customer Data to Demographics Data¶

The final task is to compare the cluster distributions between the general population and the customer base to identify which clusters are overrepresented or underrepresented among the customers.

#plots showing clusters
#used for syntax https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18,5))

sns.countplot(gen_pop_predict, ax=ax1)
ax1.set_title('General Clusters')
sns.countplot(cust_predict, ax=ax2)
ax2.set_title('Customer Clusters');

#What kinds of people are part of a cluster that is overrepresented in the customer data compared to the general population?
#used for syntax https://stackoverflow.com/questions/49885007/how-to-use-scikit-learn-inverse-transform-with-new-values
cluster_11 = s.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[11]))

over = pd.Series(data = cluster_11, index=clean_customers.columns)
over

ALTERSKATEGORIE_GROB       3.285754
ANREDE_KZ                  1.049813
FINANZ_MINIMALIST          4.818104
FINANZ_SPARER              1.639848
FINANZ_VORSORGER           4.172732
FINANZ_ANLEGER             1.472954
FINANZ_UNAUFFAELLIGER      2.171839
FINANZ_HAUSBAUER           1.902001
GREEN_AVANTGARDE           0.972321
HEALTH_TYP                 2.056336
RETOURTYP_BK_S             3.960810
SEMIO_SOZ                  4.861855
SEMIO_FAM                  4.861142
SEMIO_REL                  3.766625
SEMIO_MAT                  3.940147
SEMIO_VERT                 6.235738
SEMIO_LUST                 5.286958
SEMIO_ERL                  4.070679
SEMIO_KULT                 4.898366
SEMIO_RAT                  2.687683
SEMIO_KRIT                 3.126071
SEMIO_DOM                  3.363179
SEMIO_KAEM                 2.650101
SEMIO_PFLICHT              3.246229
SEMIO_TRADV                3.161888
SOHO_KZ                    0.008696
VERS_TYP                   1.530339
ANZ_PERSONEN               2.206120
ANZ_TITEL                  0.000312
HH_EINKOMMEN_SCORE         2.456882
                            ...    
OST_WEST_KZ                0.103911
WOHNLAGE                   3.013820
KBA05_ANTG1                2.666454
KBA05_ANTG2                1.138174
KBA05_ANTG3                0.078994
KBA05_ANTG4                0.005810
KBA05_GBZ                  4.031641
BALLRAUM                   3.972742
EWDICHTE                   4.065986
INNENSTADT                 4.538643
GEBAEUDETYP_RASTER         3.955746
KKK                        2.161797
MOBI_REGIO                 4.085188
ONLINE_AFFINITAET          3.094526
REGIOTYP                   3.490836
KBA13_ANZAHL_PKW         690.237486
PLZ8_ANTG1                 2.757443
PLZ8_ANTG2                 2.616766
PLZ8_ANTG3                 1.221891
PLZ8_ANTG4                 0.400610
PLZ8_BAUMAX                1.260107
PLZ8_HHZ                   3.725018
PLZ8_GBZ                   3.891678
ARBEIT                     3.066084
ORTSGR_KLS9                5.529081
RELAT_AB                   2.950813
DECADE                     2.620571
MOVEMENT                   0.027679
WEALTH                     3.614265
LIFE_STAGE                 3.614265
Length: 66, dtype: float64

# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
cluster_10 = s.inverse_transform(pca.inverse_transform(kmeans.cluster_centers_[10]))

under = pd.Series(data = cluster_10, index=clean_customers.columns)
under

ALTERSKATEGORIE_GROB       3.316416
ANREDE_KZ                  1.051005
FINANZ_MINIMALIST          3.529760
FINANZ_SPARER              1.975809
FINANZ_VORSORGER           4.199857
FINANZ_ANLEGER             1.815016
FINANZ_UNAUFFAELLIGER      1.888913
FINANZ_HAUSBAUER           3.233434
GREEN_AVANTGARDE           0.033785
HEALTH_TYP                 2.373274
RETOURTYP_BK_S             4.258308
SEMIO_SOZ                  4.818082
SEMIO_FAM                  5.106048
SEMIO_REL                  3.806917
SEMIO_MAT                  4.155899
SEMIO_VERT                 5.825029
SEMIO_LUST                 4.943657
SEMIO_ERL                  4.231193
SEMIO_KULT                 4.962274
SEMIO_RAT                  2.645156
SEMIO_KRIT                 3.673223
SEMIO_DOM                  3.401704
SEMIO_KAEM                 2.818262
SEMIO_PFLICHT              3.364461
SEMIO_TRADV                2.899784
SOHO_KZ                    0.008312
VERS_TYP                   1.648778
ANZ_PERSONEN               1.402718
ANZ_TITEL                  0.000301
HH_EINKOMMEN_SCORE         5.125255
                            ...    
OST_WEST_KZ                0.167080
WOHNLAGE                   3.587525
KBA05_ANTG1                0.844066
KBA05_ANTG2                1.913849
KBA05_ANTG3                1.034797
KBA05_ANTG4                0.224023
KBA05_GBZ                  2.736385
BALLRAUM                   4.265283
EWDICHTE                   4.125782
INNENSTADT                 4.632602
GEBAEUDETYP_RASTER         3.768550
KKK                        2.861523
MOBI_REGIO                 2.423838
ONLINE_AFFINITAET          1.872489
REGIOTYP                   4.752699
KBA13_ANZAHL_PKW         673.140593
PLZ8_ANTG1                 2.136078
PLZ8_ANTG2                 3.172382
PLZ8_ANTG3                 1.781339
PLZ8_ANTG4                 0.666453
PLZ8_BAUMAX                1.760604
PLZ8_HHZ                   3.819944
PLZ8_GBZ                   3.562750
ARBEIT                     3.376590
ORTSGR_KLS9                5.514092
RELAT_AB                   3.374855
DECADE                     2.533608
MOVEMENT                   0.966215
WEALTH                     2.812114
LIFE_STAGE                 2.812114
Length: 66, dtype: float64

Discussion 3.3: Compare Customer Data to Demographics Data¶

The same groups that seem to over represented in the customer data (cluster 11) are the same top groups in the under represented data (cluster 10):

ALTERSKATEGORIE_GROB - Estimated age based on given name analysis - 46 - 60 year olds

ANREDE_KZ - Gender - Male

FINANZ_MINIMALIST - MINIMALIST: low financial interest Avg - Lowest

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
0	1.184102	-1.043914	1.409077	-1.155482	1.138133	-1.249960	-0.470299	-0.791347	1.885219	-1.610602	...	-0.633329	1.430429	1.485680	-2.173492	-1.432853	-1.526014	-1.619800	-1.885219	0.085189	0.085189
1	1.184102	0.957933	1.409077	-1.155482	1.138133	-1.249960	0.958818	0.629041	1.885219	-0.257026	...	0.748876	-0.648841	-1.248668	-0.166432	-0.127075	-0.052103	-1.619800	-1.885219	0.761271	0.761271
2	1.184102	-1.043914	1.409077	-1.155482	1.138133	-0.570958	-1.184858	-0.791347	-0.530442	-0.257026	...	-0.633329	-0.648841	0.574231	-2.173492	-0.997594	-1.526014	-2.317978	0.530442	0.761271	0.761271
3	0.200529	-1.043914	-0.042764	-1.155482	0.411859	0.787046	1.673376	-0.791347	-0.530442	1.096551	...	0.057773	-0.648841	-0.337218	-0.166432	-0.127075	-1.526014	-0.223444	0.530442	-1.266976	-1.266976
4	0.200529	-1.043914	1.409077	-1.155482	1.138133	-1.249960	-0.470299	-0.081153	1.885219	1.096551	...	-0.633329	1.430429	1.485680	-0.166432	0.743444	1.421808	-1.619800	-1.885219	0.761271	0.761271

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
count	891221.000000	891221.000000	891221.000000	886367.000000	891221.000000	891221.000000	891221.000000	891221.000000	891221.000000	891221.000000	...	774706.000000	774706.000000	774706.000000	774706.000000	774706.000000	774706.000000	774706.000000	794005.000000	794005.000000	794005.00000
mean	-0.358435	2.777398	1.522098	3.632838	3.074528	2.821039	3.401106	3.033328	2.874167	3.075121	...	2.253330	2.801858	1.595426	0.699166	1.943913	3.612821	3.381087	3.167854	5.293002	3.07222
std	1.198724	1.068775	0.499512	1.595021	1.321055	1.464749	1.322134	1.529603	1.486731	1.353248	...	0.972008	0.920309	0.986736	0.727137	1.459654	0.973967	1.111598	1.002376	2.303739	1.36298
min	-1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	0.000000	1.00000
25%	-1.000000	2.000000	1.000000	2.000000	2.000000	1.000000	3.000000	2.000000	2.000000	2.000000	...	1.000000	2.000000	1.000000	0.000000	1.000000	3.000000	3.000000	3.000000	4.000000	2.00000
50%	-1.000000	3.000000	2.000000	4.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	...	2.000000	3.000000	2.000000	1.000000	1.000000	4.000000	3.000000	3.000000	5.000000	3.00000
75%	-1.000000	4.000000	2.000000	5.000000	4.000000	4.000000	5.000000	5.000000	4.000000	4.000000	...	3.000000	3.000000	2.000000	1.000000	3.000000	4.000000	4.000000	4.000000	7.000000	4.00000
max	3.000000	9.000000	2.000000	6.000000	5.000000	5.000000	5.000000	5.000000	5.000000	5.000000	...	4.000000	4.000000	3.000000	2.000000	5.000000	5.000000	5.000000	9.000000	9.000000	9.00000

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	attribute	information_level	type	missing_or_unknown
count	85	85	85	85
unique	85	9	5	9
top	FINANZ_MINIMALIST	person	ordinal	[-1]
freq	1	43	49	26

	attribute	information_level	type	missing_or_unknown
0	AGER_TYP	person	categorical	[-1,0]
1	ALTERSKATEGORIE_GROB	person	ordinal	[-1,0,9]
2	ANREDE_KZ	person	categorical	[-1,0]
3	CJT_GESAMTTYP	person	categorical	[0]
4	FINANZ_MINIMALIST	person	ordinal	[-1]
5	FINANZ_SPARER	person	ordinal	[-1]
6	FINANZ_VORSORGER	person	ordinal	[-1]
7	FINANZ_ANLEGER	person	ordinal	[-1]
8	FINANZ_UNAUFFAELLIGER	person	ordinal	[-1]
9	FINANZ_HAUSBAUER	person	ordinal	[-1]
10	FINANZTYP	person	categorical	[-1]
11	GEBURTSJAHR	person	numeric	[0]
12	GFK_URLAUBERTYP	person	categorical	[]
13	GREEN_AVANTGARDE	person	categorical	[]
14	HEALTH_TYP	person	ordinal	[-1,0]
15	LP_LEBENSPHASE_FEIN	person	mixed	[0]
16	LP_LEBENSPHASE_GROB	person	mixed	[0]
17	LP_FAMILIE_FEIN	person	categorical	[0]
18	LP_FAMILIE_GROB	person	categorical	[0]
19	LP_STATUS_FEIN	person	categorical	[0]
20	LP_STATUS_GROB	person	categorical	[0]
21	NATIONALITAET_KZ	person	categorical	[-1,0]
22	PRAEGENDE_JUGENDJAHRE	person	mixed	[-1,0]
23	RETOURTYP_BK_S	person	ordinal	[0]
24	SEMIO_SOZ	person	ordinal	[-1,9]
25	SEMIO_FAM	person	ordinal	[-1,9]
26	SEMIO_REL	person	ordinal	[-1,9]
27	SEMIO_MAT	person	ordinal	[-1,9]
28	SEMIO_VERT	person	ordinal	[-1,9]
29	SEMIO_LUST	person	ordinal	[-1,9]
30	SEMIO_ERL	person	ordinal	[-1,9]
31	SEMIO_KULT	person	ordinal	[-1,9]
32	SEMIO_RAT	person	ordinal	[-1,9]
33	SEMIO_KRIT	person	ordinal	[-1,9]
34	SEMIO_DOM	person	ordinal	[-1,9]
35	SEMIO_KAEM	person	ordinal	[-1,9]
36	SEMIO_PFLICHT	person	ordinal	[-1,9]
37	SEMIO_TRADV	person	ordinal	[-1,9]
38	SHOPPER_TYP	person	categorical	[-1]
39	SOHO_KZ	person	categorical	[-1]

	Attribute	Count	Percentage
0	AGER_TYP	685843	76.955435
1	ALTERSKATEGORIE_GROB	2881	0.323264
2	CJT_GESAMTTYP	4854	0.544646
3	GEBURTSJAHR	392318	44.020282
4	GFK_URLAUBERTYP	4854	0.544646

	Attribute	Count	Percentage
17	TITEL_KZ	889061	99.757636
0	AGER_TYP	685843	76.955435
23	KK_KUNDENTYP	584612	65.596749
40	KBA05_BAUMAX	476524	53.468668
3	GEBURTSJAHR	392318	44.020282
19	ALTER_HH	310267	34.813699
46	KKK	158064	17.735668
49	REGIOTYP	158064	17.735668
24	W_KEIT_KIND_HH	147988	16.605084
38	KBA05_ANTG3	133324	14.959701
36	KBA05_ANTG1	133324	14.959701
37	KBA05_ANTG2	133324	14.959701
41	KBA05_GBZ	133324	14.959701
39	KBA05_ANTG4	133324	14.959701
47	MOBI_REGIO	133324	14.959701
52	PLZ8_ANTG2	116515	13.073637
51	PLZ8_ANTG1	116515	13.073637
53	PLZ8_ANTG3	116515	13.073637
54	PLZ8_ANTG4	116515	13.073637
55	PLZ8_BAUMAX	116515	13.073637
56	PLZ8_HHZ	116515	13.073637
57	PLZ8_GBZ	116515	13.073637
18	VERS_TYP	111196	12.476816
15	SHOPPER_TYP	111196	12.476816
5	HEALTH_TYP	111196	12.476816
12	NATIONALITAET_KZ	108315	12.153551
13	PRAEGENDE_JUGENDJAHRE	108164	12.136608
50	KBA13_ANZAHL_PKW	105800	11.871354
26	ANZ_HAUSHALTE_AKTIV	99611	11.176913
35	CAMEO_INTL_2015	99352	11.147852
...	...	...	...
34	CAMEO_DEU_2015	99352	11.147852
6	LP_LEBENSPHASE_FEIN	97632	10.954859
58	ARBEIT	97375	10.926022
60	RELAT_AB	97375	10.926022
59	ORTSGR_KLS9	97274	10.914689
27	ANZ_HH_TITEL	97008	10.884842
7	LP_LEBENSPHASE_GROB	94572	10.611509
44	INNENSTADT	93740	10.518154
42	BALLRAUM	93740	10.518154
43	EWDICHTE	93740	10.518154
45	GEBAEUDETYP_RASTER	93155	10.452514
32	WOHNLAGE	93148	10.451729
31	OST_WEST_KZ	93148	10.451729
30	MIN_GEBAEUDEJAHR	93148	10.451729
28	GEBAEUDETYP	93148	10.451729
9	LP_FAMILIE_GROB	77792	8.728699
8	LP_FAMILIE_FEIN	77792	8.728699
29	KONSUMNAEHE	73969	8.299737
25	WOHNDAUER_2008	73499	8.247000
21	ANZ_TITEL	73499	8.247000
20	ANZ_PERSONEN	73499	8.247000
16	SOHO_KZ	73499	8.247000
22	HH_EINKOMMEN_SCORE	18348	2.058749
14	RETOURTYP_BK_S	4854	0.544646
11	LP_STATUS_GROB	4854	0.544646
10	LP_STATUS_FEIN	4854	0.544646
4	GFK_URLAUBERTYP	4854	0.544646
2	CJT_GESAMTTYP	4854	0.544646
48	ONLINE_AFFINITAET	4854	0.544646
1	ALTERSKATEGORIE_GROB	2881	0.323264

	CJT_GESAMTTYP	FINANZTYP	GFK_URLAUBERTYP	LP_FAMILIE_FEIN	LP_FAMILIE_GROB	LP_STATUS_FEIN	LP_STATUS_GROB	NATIONALITAET_KZ	SHOPPER_TYP	ZABEOTYP	GEBAEUDETYP	CAMEO_DEUG_2015	CAMEO_DEU_2015
1	5.0	1	10.0	5.0	3.0	2.0	1.0	1.0	3.0	5	8.0	8	8A
2	3.0	1	10.0	1.0	1.0	3.0	2.0	1.0	2.0	5	1.0	4	4C
3	2.0	6	1.0	NaN	NaN	9.0	4.0	1.0	1.0	3	1.0	2	2A
4	5.0	5	5.0	10.0	5.0	3.0	2.0	1.0	2.0	4	1.0	6	6B
5	2.0	2	1.0	1.0	1.0	4.0	2.0	1.0	0.0	4	1.0	8	8C

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	row_null_count	DECADE	MOVEMENT
1	1.0	2	1	5	2	5	4	5	0	3.0	...	1.0	1.0	5.0	4.0	3.0	5.0	4.0	0	14.0	14.0
2	3.0	2	1	4	1	2	3	5	1	3.0	...	0.0	1.0	4.0	4.0	3.0	5.0	2.0	0	15.0	15.0
3	4.0	2	4	2	5	2	1	2	0	2.0	...	0.0	1.0	3.0	4.0	2.0	3.0	3.0	7	8.0	8.0
4	3.0	1	4	3	4	1	3	2	0	3.0	...	1.0	2.0	3.0	3.0	4.0	6.0	5.0	0	8.0	8.0
5	1.0	2	3	1	5	2	2	5	0	3.0	...	1.0	1.0	5.0	5.0	2.0	3.0	3.0	0	3.0	3.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	row_null_count	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
1	1.0	2	1	5	2	5	4	5	0	3.0	...	5.0	4.0	3.0	5.0	4.0	0	5.0	1.0	51	51
2	3.0	2	1	4	1	2	3	5	1	3.0	...	4.0	4.0	3.0	5.0	2.0	0	5.0	0.0	24	24
3	4.0	2	4	2	5	2	1	2	0	2.0	...	3.0	4.0	2.0	3.0	3.0	7	3.0	1.0	12	12
4	3.0	1	4	3	4	1	3	2	0	3.0	...	3.0	3.0	4.0	6.0	5.0	0	3.0	1.0	43	43
5	1.0	2	3	1	5	2	2	5	0	3.0	...	5.0	5.0	2.0	3.0	3.0	0	1.0	1.0	54	54

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
0	1.0	2.0	1.0	5.0	2.0	5.0	4.0	5.0	0.0	3.0	...	1.0	5.0	4.0	3.0	5.0	4.0	5.0	1.0	1.0	1.0
1	3.0	2.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	3.0	...	1.0	4.0	4.0	3.0	5.0	2.0	5.0	0.0	4.0	4.0
2	4.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	0.0	2.0	...	1.0	3.0	4.0	2.0	3.0	3.0	3.0	1.0	2.0	2.0
3	3.0	1.0	4.0	3.0	4.0	1.0	3.0	2.0	0.0	3.0	...	2.0	3.0	3.0	4.0	6.0	5.0	3.0	1.0	3.0	3.0
4	1.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	0.0	3.0	...	1.0	5.0	5.0	2.0	3.0	3.0	1.0	1.0	4.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	row_null_count	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
1	1.0	2	1	5	2	5	4	5	0	3.0	...	5.0	4.0	3.0	5.0	4.0	0	5.0	1.0	51	51
2	3.0	2	1	4	1	2	3	5	1	3.0	...	4.0	4.0	3.0	5.0	2.0	0	5.0	0.0	24	24
3	4.0	2	4	2	5	2	1	2	0	2.0	...	3.0	4.0	2.0	3.0	3.0	7	3.0	1.0	12	12
4	3.0	1	4	3	4	1	3	2	0	3.0	...	3.0	3.0	4.0	6.0	5.0	0	3.0	1.0	43	43
5	1.0	2	3	1	5	2	2	5	0	3.0	...	5.0	5.0	2.0	3.0	3.0	0	1.0	1.0	54	54

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
0	1.0	2.0	1.0	5.0	2.0	5.0	4.0	5.0	0.0	3.0	...	1.0	5.0	4.0	3.0	5.0	4.0	5.0	1.0	1.0	1.0
1	3.0	2.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	3.0	...	1.0	4.0	4.0	3.0	5.0	2.0	5.0	0.0	4.0	4.0
2	4.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	0.0	2.0	...	1.0	3.0	4.0	2.0	3.0	3.0	3.0	1.0	2.0	2.0
3	3.0	1.0	4.0	3.0	4.0	1.0	3.0	2.0	0.0	3.0	...	2.0	3.0	3.0	4.0	6.0	5.0	3.0	1.0	3.0	3.0
4	1.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	0.0	3.0	...	1.0	5.0	5.0	2.0	3.0	3.0	1.0	1.0	4.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
0	4.0	1.0	5.0	1.0	5.0	1.0	2.0	2.0	1.0	1.0	...	1.0	5.0	5.0	1.0	2.0	1.0	1.0	0.0	3.0	3.0
1	4.0	2.0	5.0	1.0	5.0	1.0	4.0	4.0	1.0	2.0	...	3.0	3.0	2.0	3.0	5.0	3.0	1.0	0.0	4.0	4.0
2	4.0	1.0	5.0	1.0	5.0	2.0	1.0	2.0	0.0	2.0	...	1.0	3.0	4.0	1.0	3.0	1.0	0.0	1.0	4.0	4.0
3	3.0	1.0	3.0	1.0	4.0	4.0	5.0	2.0	0.0	3.0	...	2.0	3.0	3.0	3.0	5.0	1.0	3.0	1.0	1.0	1.0
4	3.0	1.0	5.0	1.0	5.0	1.0	2.0	3.0	1.0	3.0	...	1.0	5.0	5.0	3.0	7.0	5.0	1.0	0.0	4.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	row_null_count	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
1	1.0	2	1	5	2	5	4	5	0	3.0	...	5.0	4.0	3.0	5.0	4.0	0	5.0	1.0	51	51
2	3.0	2	1	4	1	2	3	5	1	3.0	...	4.0	4.0	3.0	5.0	2.0	0	5.0	0.0	24	24
3	4.0	2	4	2	5	2	1	2	0	2.0	...	3.0	4.0	2.0	3.0	3.0	7	3.0	1.0	12	12
4	3.0	1	4	3	4	1	3	2	0	3.0	...	3.0	3.0	4.0	6.0	5.0	0	3.0	1.0	43	43
5	1.0	2	3	1	5	2	2	5	0	3.0	...	5.0	5.0	2.0	3.0	3.0	0	1.0	1.0	54	54

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
0	1.0	2.0	1.0	5.0	2.0	5.0	4.0	5.0	0.0	3.0	...	1.0	5.0	4.0	3.0	5.0	4.0	5.0	1.0	1.0	1.0
1	3.0	2.0	1.0	4.0	1.0	2.0	3.0	5.0	1.0	3.0	...	1.0	4.0	4.0	3.0	5.0	2.0	5.0	0.0	4.0	4.0
2	4.0	2.0	4.0	2.0	5.0	2.0	1.0	2.0	0.0	2.0	...	1.0	3.0	4.0	2.0	3.0	3.0	3.0	1.0	2.0	2.0
3	3.0	1.0	4.0	3.0	4.0	1.0	3.0	2.0	0.0	3.0	...	2.0	3.0	3.0	4.0	6.0	5.0	3.0	1.0	3.0	3.0
4	1.0	2.0	3.0	1.0	5.0	2.0	2.0	5.0	0.0	3.0	...	1.0	5.0	5.0	2.0	3.0	3.0	1.0	1.0	4.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	DECADE	MOVEMENT	WEALTH	LIFE_STAGE
0	4.0	1.0	5.0	1.0	5.0	1.0	2.0	2.0	1.0	1.0	...	1.0	5.0	5.0	1.0	2.0	1.0	1.0	0.0	3.0	3.0
1	4.0	2.0	5.0	1.0	5.0	1.0	4.0	4.0	1.0	2.0	...	3.0	3.0	2.0	3.0	5.0	3.0	1.0	0.0	4.0	4.0
2	4.0	1.0	5.0	1.0	5.0	2.0	1.0	2.0	0.0	2.0	...	1.0	3.0	4.0	1.0	3.0	1.0	0.0	1.0	4.0	4.0
3	3.0	1.0	3.0	1.0	4.0	4.0	5.0	2.0	0.0	3.0	...	2.0	3.0	3.0	3.0	5.0	1.0	3.0	1.0	1.0	1.0
4	3.0	1.0	5.0	1.0	5.0	1.0	2.0	3.0	1.0	3.0	...	1.0	5.0	5.0	3.0	7.0	5.0	1.0	0.0	4.0	4.0