# Machine Learning - Dummy Variable Conversion

Security Engineer/Consultant at AllMed Healthcare Management
14 de Sep de 2017
1 de 6

### Machine Learning - Dummy Variable Conversion

• 1. Regression Methods in Machine Learning Categorical Variable Conversion Portland Data Science Group Andrew Ferlitsch Community Outreach Officer July, 2017
• 2. Linear Regression • All the features (independent variables) need to be a real number. • CANNOT be a categorical value, ie., a named or enumerated value. • Example: Male vs. Female Red, Blue, Green Apple, Banana, Pear, Orange
• 3. Categorical Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Independent Variables (Features) Dependent Variables (Label) Real Values Value to Predict Categorical Values
• 4. Dummy Variable Conversion Known in Python as OneHotEncoder For each categorical feature: 1. Scan the dataset and determine all the unique instances. 2. Create a new feature (i.e., dummy variable) in dataset, one per unique instance. 3. Remove the categorical feature from the dataset. 4. For each sample (row), set a 1 in the feature (dummy variable) that corresponds to that categorical value instance, and: 5. Set a 0 in the remaining features (dummy variables) for that categorical field. 6. Remove one dummy variable field.
• 5. Dummy Variable Trap Gender Male Female Male Female Need to Drop one Dummy Variable! Male Female 1 0 0 1 1 0 0 1 x1 x2 x3 Multicollinearity occurs when one variable predicts another. i.e., x2 = ( 1 – x3) As a result, a regression analysis cannot distinguish between the contribution of x2 and x3.
• 6. Drop one of Dummy Variables Age Male Income 25 1 25000 26 0 22000 30 1 45000 24 0 26000 Drop one of the Dummy Variables Age Gender Income 25 Male 25000 26 Female 22000 30 Male 45000 24 Female 26000 Gender is Replaced with Male Age Race Income 20 White Apple 26 Hispanic 22000 30 Asian 45000 24 Asian 26000 Age White Asian Income 20 1 0 Apple 26 0 0 22000 30 0 1 45000 24 0 1 26000 Dropped Hispanic (i.e., Hispanic = White: 0, Asian: 0)