03 Machine Learning Lab Guide-Student Version [PDF]

Artificial Intelligence Technology and Application

Machine Learning Lab Guide Teacher Version

Huawei Technologies CO.,

4 0 2 MB

Report DMCA / Copyright

DOWNLOAD FILE

Machine Learning

0 0 183 KB Read more

04 Deep Learning Lab Guide-Student Version

1 0 556 KB Read more

Dicoding Machine Learning (21072021)

0 0 901 KB Read more

Machine Learning Workflow

0 0 5 MB Read more

Pembelajaran Machine Learning

0 0 36 KB Read more

Kelompok 4 Machine Learning

0 0 724 KB Read more

Pyhon Machine Learning

0 0 1 MB Read more

Peer Review Jurnal Machine Learning

0 0 170 KB Read more

Daftar Isi Buku Machine Learning

0 0 846 KB Read more

The Machine Learning Journey PDF

0 0 5 MB Read more

File loading please wait...

Citation preview

Artificial Intelligence Technology and Application

Machine Learning Lab Guide Teacher Version

Huawei Technologies CO., LTD.

Contents 1 Feature Engineering on Banks' Private Credit Data ....................................................... 1 1.1 Introduction ................................................................................................................................................................................ 1 1.1.1 About This Lab ....................................................................................................................................................................... 1 1.1.2 Objectives ................................................................................................................................................................................ 1 1.1.3 Case Background................................................................................................................................................................... 1 1.2 Data Preprocessing .................................................................................................................................................................. 2 1.2.1 Background ............................................................................................................................................................................. 2 1.2.2 Procedure ................................................................................................................................................................................. 2 1.3 Feature Selection ...................................................................................................................................................................... 5 1.3.1 Background ............................................................................................................................................................................. 5 1.3.2 Feature Selection Methods ................................................................................................................................................ 5 1.3.3 Filter ........................................................................................................................................................................................... 5 1.3.4 Wrapper .................................................................................................................................................................................... 8 1.3.5 Embedded ................................................................................................................................................................................ 8 1.3.6 Variable Removal ................................................................................................................................................................10 1.4 Feature Construction .............................................................................................................................................................10 1.4.1 Background ...........................................................................................................................................................................10 1.4.2 Polynomial Feature Construction ..................................................................................................................................10

2 Real-Time Recommendation Practice for Retail Products ..........................................12 2.1 Introduction ..............................................................................................................................................................................12 2.1.1 About This Lab .....................................................................................................................................................................12 2.2 Procedure ..................................................................................................................................................................................12 2.2.1 Preparing E-commerce Platform Data ........................................................................................................................12 2.2.2 Recommending Products Based on the Product Popularity................................................................................17 2.2.3 Recommending Products Based on Collaborative Filtering ................................................................................21

3 Private Credit Default Prediction .....................................................................................24 3.1 Introduction ..............................................................................................................................................................................24 3.1.1 Objectives ..............................................................................................................................................................................24 3.1.2 Background ...........................................................................................................................................................................24 3.2 Procedure ..................................................................................................................................................................................25 3.2.1 Reading Data ........................................................................................................................................................................25 3.2.2 Viewing Missing Values ....................................................................................................................................................25 3.2.3 Splitting the Dataset ..........................................................................................................................................................26 3.2.4 Standardizing Data (Preprocessing Data) .................................................................................................................27

3.2.5 Handling the Class Imbalance Issue (Preprocessing) ............................................................................................27 3.2.6 Performing Grid Search (Modeling) ............................................................................................................................27 3.2.7 Verifying Performance (Evaluation) ............................................................................................................................28 3.2.8 Saving the Model ................................................................................................................................................................29

4 Survival Prediction of the Titanic .....................................................................................30 4.1 Introduction ..............................................................................................................................................................................30 4.1.1 About This Lab .....................................................................................................................................................................30 4.1.2 Objectives ..............................................................................................................................................................................30 4.1.3 Datasets and Frameworks ...............................................................................................................................................30 4.2 Procedure ..................................................................................................................................................................................31 4.2.1 Importing Related Libraries .............................................................................................................................................31 4.2.2 Importing Datasets .............................................................................................................................................................31 4.2.3 Preprocessing Data .............................................................................................................................................................35 4.2.4 Building a Model .................................................................................................................................................................37

5 Linear Regression .................................................................................................................39 5.1 Introduction ..............................................................................................................................................................................39 5.1.1 About This Lab .....................................................................................................................................................................39 5.1.2 Objectives ..............................................................................................................................................................................39 5.2 Procedure ..................................................................................................................................................................................39 5.2.1 Preparing Data .....................................................................................................................................................................39 5.2.2 Defining Related Functions .............................................................................................................................................40 5.2.3 Starting Iteration .................................................................................................................................................................41 5.3 Thinking and Practices .........................................................................................................................................................45 5.3.1 Question 1 .............................................................................................................................................................................45 5.3.2 Question 2 .............................................................................................................................................................................45

6 Flower Category Analysis ...................................................................................................46 6.1 Introduction ..............................................................................................................................................................................46 6.1.1 About This Lab .....................................................................................................................................................................46 6.2 Experiment Code ....................................................................................................................................................................46 6.2.1 Importing Related Libraries .............................................................................................................................................46 6.2.2 Importing a Dataset ...........................................................................................................................................................46 6.2.3 Splitting the Dataset ..........................................................................................................................................................46 6.2.4 Performing Modeling ........................................................................................................................................................47 6.2.5 Effect After Data Preprocessing.....................................................................................................................................49

7 Emotion Recognition of Customer Evaluations in the Retail Industry ....................50 7.1 Introduction ..............................................................................................................................................................................50 7.1.1 Objectives ..............................................................................................................................................................................50 7.1.2 Case background .................................................................................................................................................................50

7.2 Procedure ..................................................................................................................................................................................51 7.2.1 Data Management .............................................................................................................................................................51 7.2.2 Data Reading ........................................................................................................................................................................51 7.2.3 Data Processing ...................................................................................................................................................................53 7.2.4 Model Training.....................................................................................................................................................................55

8 Boston Housing Price Forecast ..........................................................................................60 8.1 Introduction ..............................................................................................................................................................................60 8.1.1 About This Lab .....................................................................................................................................................................60 8.1.2 Objectives ..............................................................................................................................................................................60 8.1.3 Experiment Dataset and Framework ...........................................................................................................................60 8.2 Procedure ..................................................................................................................................................................................61 8.2.1 Introducing the Dependency ..........................................................................................................................................61 8.2.2 Loading the Dataset, Viewing Data Attributes, and Visualizing Data ............................................................62 8.2.3 Splitting and Preprocessing the Dataset ....................................................................................................................63 8.2.4 Performing Modeling on the Dataset by Using Various Regression Models ................................................63 8.2.5 Adjusting Grid Search Hyperparameters ....................................................................................................................64

9 E-commerce Website User Group Analysis.....................................................................67 9.1 Introduction ..............................................................................................................................................................................67 9.1.1 About This Lab .....................................................................................................................................................................67 9.2 Experiment Code ....................................................................................................................................................................67 9.2.1 Using sklearn for Modeling .............................................................................................................................................67 9.2.2 Selecting the Optimal k Value .......................................................................................................................................70

Machine Learning Lab Guide-Student Version

1

Page 1

Feature Engineering on Banks' Private Credit Data

1.1 Introduction 1.1.1 About This Lab Feature engineering is a process of extracting features from raw data. Data and features determine the upper limit of machine learning, while models and algorithms help continuously approaching this upper limit. Feature engineering and construction aim to enable extracted features to represent the essential characteristics of data to the greatest extent, so that a model constructed based on these features has a good prediction effect on unknown datasets.

1.1.2 Objectives Upon completion of this task, you will be able to: 

Master the Python-based feature selection method.



Master the Python-based feature extraction method.



Master the Python-based feature construction method.

1.1.3 Case Background With the development of online financial services, bank H plans to evaluate customer risks by using online approval to reduce labor costs and improve approval efficiency. Online approval requires a more strict and accurate risk control model to control corporate financial risks. Therefore, algorithm engineer A needs to complete feature engineering and construction of a credit risk model before constructing the model based on historical customer credit data. Engineer A needs to complete the following operations: 

Data preprocessing



Feature selection



Feature construction

Machine Learning Lab Guide-Student Version

Page 2

1.2 Data Preprocessing 1.2.1 Background The raw data collected by the back-end server of the bank may have problems such as missing values, garbled characters, redundant fields, and inconsistent data formats. To improve the data quality, engineer A needs to cleanse the data first.

1.2.2 Procedure 1.2.2.1 Importing Data Code: import pandas as pd

Output:

1.2.2.2 Processing Missing Values Step 1

View the missing values. The missing values in the data may be caused by machine faults, manual input errors, or service attributes. The method for processing the missing values varies with the cause. missingno is a tool for visualizing missing values. You can run the following command to view the missing-value distribution in the data: Code: import missingno # Import the missingno package.

Output:

Machine Learning Lab Guide-Student Version

Page 3

As shown in the figure above, the Nation, Marriage_State, Highest Education, House_State, Industry, Title, and Duty fields contain a large number of missing values. In Pandas, isnull() can determine the missing values in data, and isnull().sum() can count the number of missing values and further check the rates of the missing values in the fields. Code:

Output:

Machine Learning Lab Guide-Student Version

Step 2

Page 4

Fill the missing values with the mode. Pandas provides fillna() to fill the missing values, and mode() to fill the missing values with the mode. You need to construct a for loop to process multiple fields that contain missing values and fill the missing values with the mode. # Define the list of fields with missing values.

# Use the for loop to process the missing values in the multiple fields.

After the processing is complete, check the missing rate of each field.

The following are methods for handling missing values: 1.

Direct deletion: deletes the non-important service fields with the missing rate greater than 80%.

2.

Data filling: fills the missing values with the determined values, statistical indicator values, and algorithm-based missing value predictions.

3.

Separate processing: processes samples with missing values as one category.

Machine Learning Lab Guide-Student Version

Page 5

----End

1.3 Feature Selection 1.3.1 Background If excessive features are obtained after data preprocessing, the model may be unstable and have poor generalization capability. As a result, the computing complexity increases exponentially. Therefore, engineer A needs to preliminarily filter out features that are not important to the prediction result.

1.3.2 Feature Selection Methods The following are methods for feature selection: 

Filter: filters features based on the statistical indicators for the feature vectors and those common to the feature vectors and the target variables.



Wrapper: attempts to use different feature subsets for modeling and use the model precision as an evaluation indicator for the feature subsets.



Embedded: evaluates feature weights during model training and scores the importance of the features.

1.3.3 Filter Step 1

Analyze the crosstab. Apply the crosstab() method to draw a crosstab by using the variable House_State and the target variable Target as an example.

Machine Learning Lab Guide-Student Version

Page 6

In the output, the default rate is 0.019 when House_State is set to 1, and is 0.045 when House_State is set to 2. If the default rates are considered the same, the variable House_State does not affect the default prediction. The crosstab analysis can only be used for preliminary judgment and analysis. The chisquare test is further needed to determine whether the numerical difference has statistical significance.

Step 2

Perform the chi-square test. Separate independent variables and dependent variables from the raw data, and select categorical variables from the independent variables. The Target field is a target variable and is assigned to y. The column with the target variable removed is assigned to X as an independent variable. X_category indicates a categorical variable.

Import the chi-square test package chi2 of sklearn.feature_selection and use chi2() to calculate the chi-square values of each categorical variable and target variable.

Step 3

Test the continuous variable correlation. If two continuous independent variables are highly correlated, delete one of the two independent variables or extract common information from the two independent variables.

The method parameter indicates the method for calculating the correlation coefficient. The options are as follows:

Machine Learning Lab Guide-Student Version

Page 7



pearson: Pearson correlation coefficient.



kendall: correlation coefficient for unordered categorical variables.



spearman: Spearman correlation coefficient, which is mainly used for correlation analysis of non-linearly and non-normally distributed data. import matplotlib.pyplot as plt import seaborn as sns

Calculate the correlation coefficient between continuous independent variables and select the combination of independent variables whose correlation coefficient is greater than 0.8.

----End

Machine Learning Lab Guide-Student Version

Page 8

1.3.4 Wrapper In the wrapper selection method, different feature subsets are used for modeling, the model precision is used as the evaluation indicator for the feature subsets, and a base model is selected to perform multi-round training. After each round of training, features of some weight coefficients are removed, and then the next round of training is performed based on the new feature set. The RFE() method of the feature_selection submodule in sklearn is invoked. The logistic regression model LogisticRegressio() is used as the base model to be invoked, and parameter will be transferred into this model. Wrapper: estimator: basic training model, which is a logistic regression model in this example. n_features_to_select: indicates the number of retained features. fit(X,y): invokes and trains a model. from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression

Output: 20 [ True True False True True True False True True True True False False True False True True False True True True True True True True False False False True False] [ 1 1 9 1 1 1 10 1 1 1 1 6 3 1 11 1 1 8 1 1 1 1 1 1 1 5 4 7 1 2] LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False)

The return value of the RFE() method is output, which can be any of the following: 

n_features_: number of selected features, that is, the value of the n_features_to_select parameter transferred into the RFE() method.



support_: indicates that the selected features are displayed at their corresponding positions. True indicates that the feature is retained, and False indicates that the feature is removed.



ranking_: indicates the feature ranking. ranking_[i] corresponds to the ranking of the ith feature. The value 1 indicates the optimal feature. The selected features are the 20 feature corresponding to the value 1, namely, the optimal features.



estimator_: returns the parameters of the base model.

1.3.5 Embedded The embedded method uses a machine learning model for training to obtain weight coefficients of features, and selects features in descending order of the weight coefficients.

Machine Learning Lab Guide-Student Version

Page 9

Common embedded methods are based on either of the following: 

Linear model and regularization



Feature selection of a tree model

In the tree model, the importance of a feature is determined by the depth of the leaf node. In this experiment, the random forest is used to calculate the importance of a feature. The random forest classification method in the sklearn.ensemble submodule is invoked to train the model by using the fit(X,y) method. from sklearn.ensemble import RandomForestClassifier

After the model training is complete, the weight evaluation value of each feature is printed.

Output: [(0.1315, 'Ast_Curr_Bal'), (0.1286, 'Age'), (0.0862, 'Year_Income'), (0.0649, 'Std_Cred_Limit'), (0.043, 'ZX_Max_Account_Number'), (0.0427, 'Highest Education'), (0.0416, 'ZX_Link_Max_Overdue_Amount'), (0.0374, 'ZX_Max_Link_Banks'), (0.0355, 'Industry'), (0.0354, 'ZX_Max_Overdue_Duration'), (0.0311, 'ZX_Total_Overdu_Months'), (0.0305, 'Marriage_State'), (0.0305, 'Duty'), (0.0292, 'Couple_Year_Income'), (0.0279, 'ZX_Credit_Max_Overdu_Amount'), (0.0246, 'ZX_Max_Overdue_Account'), (0.0241, 'ZX_Max_Credit_Banks'), (0.0221, 'ZX_Max_Credits'), (0.0205, 'Birth_Place'), (0.0195, 'Loan_Curr_Bal'), (0.0173, 'L12_Month_Pay_Amount'), (0.015, 'ZX_Credit_Max_Overdue_Duration'), (0.013, 'Title'), (0.0097, 'ZX_Credit_Total_Overdue_Months'), (0.0096, 'Nation'), (0.0084, 'Gender'), (0.0079, 'Work_Years'), (0.0064, 'ZX_Max_Overdue_Credits'), (0.0059, 'House_State'), (0.0, 'Couple_L12_Month_Pay_Amount')]

Machine Learning Lab Guide-Student Version

Page 10

1.3.6 Variable Removal Based on the results of the preceding three variable selection methods, the variables that have a small model impact weight are removed.

1.4 Feature Construction 1.4.1 Background Feature selection is used to find the importance of each feature to model construction and remove variables that have little impact on model construction to reduce dimensions. Feature construction is to manually discover some significant model features from raw data. On the one hand, a new variable can be constructed by combining several different features based on service understanding; on the other hand, variables of different time windows can be divided according to a time attribute. Engineer A has completed the preliminary filtering of features, removed some variables that have little impact on the model, and now attempts to construct some new features to improve the model precision.

1.4.2 Polynomial Feature Construction Polynomial feature construction is to explore the impact of complex variables on the target variable through product transformation for existing features. The PolynomialFeatures() method in the sklearn.preprocessing submodule is used for feature interaction. Variables with higher scores in the model, that is, Ast_Curr_Bal, Age, Year_Income, and Std_Cred_Limit, are selected to construct a polynomial feature. PolynomialFeatures(degree=3): specifies the degree 3 of interaction between variables, that is, constructs a polynomial whose sum of powers of two variables is 3. from sklearn.preprocessing import PolynomialFeatures # Select fields used to construct the polynomial feature.

Machine Learning Lab Guide-Student Version

Page 11

# Invoke the fit() method to construct the polynomial feature. # Convert data.

To check the correlation between the newly generated variable and the target variable, construct a dataset containing the target variable and the newly generated variable first.

The corr() function is used to calculate the correlation coefficient between the newly generated variable and the target variable.

Output: Five features with the smallest correlation coefficients: Age^3 -0.010601 Age^2 -0.009275 Age^2 Std_Cred_Limit -0.008064 Age -0.007356 Age Std_Cred_Limit -0.006834 Name: Target, dtype: float64 Five features with the largest correlation coefficients: Year_Income^3 -0.001910 Ast_Curr_Bal Age -0.001114 Ast_Curr_Bal 0.002849 Target 1.000000 1 NaN Name: Target, dtype: float64

Machine Learning Lab Guide-Student Version

2

Page 12

Real-Time Recommendation Practice for Retail Products

2.1 Introduction 2.1.1 About This Lab Mr. Zhao works in the AI algorithm department of e-commerce platform company A and is responsible for product recommendation for online businesses. In the modern world of the Internet and e-commerce, people are overwhelmed by data that provides useful information. However, it is impossible for users to extract the information they are interested in from the data. To help users find product information, the recommendation system can create similarities between users and products and provide suggestions for customers based on the similarities. The recommendation system is beneficial in: 

Helping users find the right products.



Increasing user engagement. Providing recommendations. For example, Google News saw a 40% increase in hits due to recommendations.



Helping project providers deliver projects to the right users. At Amazon, 35% of products are sold through recommendations.



Helping personalize the recommended content. In Netflix, most rented movies are recommended ones.

2.2 Procedure 2.2.1 Preparing E-commerce Platform Data Step 1

Import the required packages. Functions in the NumPy library are used to perform basic operations on arrays. Pandas provides many data processing methods and time sequence operation methods. # Import module packages required by the project. import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Step 2

Read data.

Machine Learning Lab Guide-Student Version

Step 3

Page 13

Check the data overview. View the format of the read data. You can use the head() function to check the first five rows of the data to get a rough understanding of the data content.

Step 4

View the data size. You can further view the data size (the number of samples and the number of features in the data) by using the shape function.

Step 5

View the data type. After learning of the data size, you still need to view the data type by using the dtypes() function to facilitate subsequent data calculation.

According to the result, only Rating and timestamp fall into the numeric type and can be used for mathematical calculation. If userId and productId need to be used for mathematical calculation, convert the types of them. In addition, you can use the info() function to view the general information about the data.

Machine Learning Lab Guide-Student Version

Page 14

The result contains the number of data samples, feature type, data type, and data storage size. The info function can display the preceding information by default, but you can set an item to False to hide the item. For example, you can run the following command to hide the data storage size:

Step 6

View the product ratings from users. Product ratings are important data that can reflect users' preference. The data is critical to an efficient recommendation system. You can use the describe function to check the data overview of the numeric type. To view only the preliminary data analysis of Rating, add the corresponding column name in square brackets to the end of the command.

The result contains the average value, maximum value, minimum value, standard deviation, and quartile of the data, and the product rating is generally about 4. You can use the min() and max() functions to print the maximum and minimum value of the rating.

Machine Learning Lab Guide-Student Version

Page 15

You can also use the print() function to print the result or the value of a parameter. According to the result, the highest rating is 5, indicating that users' ratings on the product are generally high.

Step 7

View the default values of the data. The most important factors that affect data quality are default values and abnormal values. As the ratings all fall within the normal ranges, you need to use the isnull() function to check whether the parameter is null, and then use the sum() function to count the total number of non-null parameters.

Step 8

Check whether the users and products are unique. A user can rate multiple products. Similarly, a product can be rated by different users. To determine the product types and the number of users, you need to check whether the users and products are unique.

Step 9

Delete time information. You can use the drop() function to delete the product rating time. axis: deletes the column name part when it is set to 1, and deletes the index number part when it is set to 0. inplace: indicates the operation result when it is set to True.

Machine Learning Lab Guide-Student Version

Step 10

Page 16

Analyze the rating data. Sort the users and products by rating and view the sorting result. groupby(): performs matching based on the specific data. sort_values(): sorts a group of data. ascending: ascending order.

After obtaining the product data corresponding to the sorted user ratings, the system returns the quantiles by using the quantile() function, and displays the quantiles by icons.

# quantiles with 0.05 difference # quantiles with 0.25 difference

Machine Learning Lab Guide-Student Version

Page 17

----End

2.2.2 Recommending Products Based on the Product Popularity Sorting the products by rating to analyze the product popularity during data preparation helps implement recommendation based on the product popularity.

Step 1

Sort products. Similar to user sorting, products can be sorted based on the rating data to obtain products that have been rated for more than 50 times.

Machine Learning Lab Guide-Student Version

Step 2

Page 18

Sort the products by the average rating. Calculate the average rating of each product, and then sort the products based on the average rating. # Calculate the average rating of each product.

# Obtain the rankings of the products sorted by the number of rating times.

Machine Learning Lab Guide-Student Version

Page 19

The result shows that the product with the highest average rating is rated by 1051 users.

Step 3

Visualize the result. Analyze the product rankings and display the result in a chart. Specifically, use a histogram first to display the distribution of the number of users who rate each type of products. hist(): histogram bins: number of buckets in the histogram.

Machine Learning Lab Guide-Student Version

Page 20

Sort the products by the number of users who rate the products, to obtain the product popularity.

Machine Learning Lab Guide-Student Version

Page 21

----End

2.2.3 Recommending Products Based on Collaborative Filtering Recommending products based on the product popularity is not enough to meet the actual requirement. Therefore, the currently popular collaborative filtering method is used to implement recommendation.

Step 1

Create a table of relationships between products and users. Select 10,000 samples and use pivot_table() to create a table of relationships between products and users.

You can use the shape function to view the table size, and then transform the table. The data in the table is the product ratings from users.

Machine Learning Lab Guide-Student Version

Page 22

# View the data size.

Step 2

Decompose the table. You can use the SVD algorithm to reduce the dimensions of the table to obtain 10 important product-based features. from sklearn.decomposition import TruncatedSVD # Import the SVD algorithm. # Construct an SVD model to combine the number of features (that is, the number of columns) into 10 important combined features.

Step 3

Build a correlation coefficient matrix. Calculate product similarities to implement a product-based recommendation system. corrcoef(): calculates the correlation coefficient between data.

Step 4

Recommend products based on the product similarities.

Machine Learning Lab Guide-Student Version

Page 23

Randomly select a product, select products whose coefficient of correlation with the selected one is greater than 0.65, and recommend these products to users who like the selected one.

# Determine whether the product is unique.

# View the number of similar products in the similarity table.

#Select products whose coefficient of correlation with the 20th product is greater than 0.65. # Delete the 20th product. # Recommend products ranked ahead to the users who like the 20th product.

As shown in the result, there are eight products whose coefficient of correlation with the 20th product (9984984354) is greater than 0.65. You can also select other products to view their similar products.

----End

Machine Learning Lab Guide-Student Version

3

Page 24

Private Credit Default Prediction

3.1 Introduction Under the impact of the Internet, financial institutions are suffering from internal and external troubles. On one hand, financial institutions encounter great competition and performance pressure from large financial and technology enterprises; on the other hand, more and more criminal groups use artificial intelligence (AI) technologies to increase the crime efficiency. These risk details are hidden in each transaction phase. If they are not prevented, losses will be irreparable. Therefore, financial institutions pose increasingly high requirements on risk management accuracy and approval efficiency. This experiment will discuss the problem and perform practice step by step from the perspectives of problem statement, breakdown, priority ranking, solution design, key point analysis, and summary and suggestions, and cultivate the project implementation thinking and implement analysis of the private credit default prediction from scratch.

3.1.1 Objectives Upon completion of this task, you will be able to: 

Understand the significance of credit default prediction.



Master the development process of big data mining projects.



Master the common algorithms for private credit default prediction.



Understand the importance of data processing and feature engineering.



Master the common methods for data preprocessing and feature engineering



Master the algorithm principles of logistic regression and XGBoost, and understand the key parameters.

3.1.2 Background The case in this document is for reference only. The actual procedure may vary. For details, see the corresponding product documents. The company has just set up a project team for private credit default prediction. Engineer A was appointed as the offline development PM of the project. This project aims to: 

Identify high-risk customers efficiently and accurately using new technologies.



Make risk modes data-based by using scientific methods.



Provide objective risk measurement.



Reduce subjective judgments.



Improve risk management efficiency.

Machine Learning Lab Guide-Student Version



Page 25

Save labor costs.

The ultimate goal is to productize the results, so that front-end operating departments can identify transactions with credit default risks in a timely manner to avoid corporate losses.

3.2 Procedure 3.2.1 Reading Data First, import the dataset. This document uses a third-party module from Pandas to import the dataset. import pandas as pd # Use pd.read_csv to read the dataset. (The dataset is stored in the current directory so that it can be read directly.) # ./credit.csv indicates the current directory. The slash (/) here must be in the same direction as one in a directory of the Linux operating system (OS). # In the Windows OS, the backslash (\) is used. Therefore, the slash in the file path must be the same as that in the Linux OS. # Be aware of using the slash symbol in the same key on the keyboard as the question mark (?).

# An auxiliary module warnings can be imported.

# This module can help filter many redundant and annoying warnings. # After data reading, some simple operations can be performed, for example: # Run the following command to view all data. data # Run the following command to view the first 10 rows of data. # Run the following command to view the length and width of data in the matrix format. data.shape

3.2.2 Viewing Missing Values # Check the data missing status in a visualized manner. # The third-party library missingno is used.

Machine Learning Lab Guide-Student Version

Page 26

# Many values are missing and need to be filled. # There are many filling methods. The missing values can be filled with average values, medians, and the mode. #The numeric types include discrete and continuous. # If the average values are used, a new discrete value may be generated. Therefore, the mode is used

#isnull() is used to determine whether a value is null. If yes, True is returned. If not, False is returned. # In Python, 1 is equal to True, and 0 is equal to False. # Therefore, sum() is used for judgment. If the result is greater than 0, True is displayed. # The features with missing values are placed in the missname list. #fillna() is used to fill empty values with the mode.

3.2.3 Splitting the Dataset Then, split the dataset. Before splitting the dataset, remove the index as it is unimportant and interferes with the model judgment, and remove Target (result) which cannot be used as input into the model.

X is equivalent to an independent variable in mathematics, and y is equivalent to a dependent variable. Import the dataset splitting function to split the dataset. from sklearn.model_selection import train_test_split

X_train is the training set, and y_train is the answer to the training set. X_test is the test set, and y_test is the answer to the test set. test_size=0.1 indicates that the ratio of the

Machine Learning Lab Guide-Student Version

Page 27

training set to the test set is 9:1. shuffle indicates that the training set and test set are shuffled.

3.2.4 Standardizing Data (Preprocessing Data) After the dataset is split, standardize the data. from sklearn.preprocessing import StandardScaler

the standardization function StandardScaler() is first declared. The following fit function is used to obtain the standard deviation and average value of the dataset. Then, transform is used to transform the data.

3.2.5 Handling the Class Imbalance Issue (Preprocessing) Next, handle the class imbalance issue, which is mainly about the difference between the number of positive results and the number of negative results of the data. In this dataset, the number of defaulters (represented by 1) is very small, and the number of nondefaulters (represented by 0) is very large. Therefore, the model tends to determine people as non-defaulters due to class imbalance. Check the current result ratio first. from collections import Counter # Use collections in the standard library to query the results. # Import the third-party library imblearn. from imblearn import over_sampling

# X_train_fix and y_train_fix are the corrected data. # Next, check the number of samples. # Check the corrected y_train_fix instead of the original y_train.

3.2.6 Performing Grid Search (Modeling) from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # Declare the logistic regression algorithm and set max_iter (the maximum number of training times) to 500. # Perform judgment based on the cross verification thinking to help split the dataset. # cv=5 indicates that the dataset is split into five equal parts.

# Apply grid search to find the optimal parameters through traversal. # Import the grid search module. from sklearn.model_selection import GridSearchCV # C indicates the regularization coefficient. # solvers indicates the optimization method.

Machine Learning Lab Guide-Student Version

Page 28

# Combine the regularization coefficient with the optimization method using the dictionary method. # Declare the logistic regression algorithm. # Declare the grid search algorithm and describe the cross verification method. # Perform training. # Check the optimal accuracy. # Check which parameters are optimal.

3.2.7 Verifying Performance (Evaluation) Then, verify the result. from sklearn.metrics import accuracy_score,precision_score,recall_score,roc_auc_score # Use the obtained optimal parameters for modeling.

# Construct a function to return the verification result.

# Output the performance data of the training set. # Output the performance data of the test set.

Machine Learning Lab Guide-Student Version

Page 29

3.2.8 Saving the Model Save the model. # Import the joblib library import joblib # The dump function is used to save models. Enter the trained model and the model name into the dump function to save the model. # The model file must be suffixed with .pkl. # Load the saved model again. # After the model is loaded, use the model for prediction directly.

Machine Learning Lab Guide-Student Version

4

Page 30

Survival Prediction of the Titanic

4.1 Introduction 4.1.1 About This Lab This experiment is to predict whether passengers on the Titanic can survive based on the Titanic datasets.

4.1.2 Objectives Upon completion of this task, you will be able to: 

Use the Titanic datasets open to the Internet as the model input data.



Build, train, and evaluate machine learning models



Understand the overall process of building a machine learning model.

4.1.3 Datasets and Frameworks This experiment is based on train.csv and test.csv. test.csv contains the result about whether the passengers can survive. This dataset has no target, that is, no result, and can be used as a real-world dataset. Involved parameters are as follows: 

PassengerId: passenger ID



Pclass: cabin class (class 1/2/3)



Name: passenger name



Sex: gender



Age: age



SibSp: number of siblings/number of spouses



Parch: number of parents/number of children



Ticket: ticket No.



Fare: ticket price



Cabin: cabin No.



Embarked: port of boarding

Machine Learning Lab Guide-Student Version

4.2 Procedure 4.2.1 Importing Related Libraries import pandas as pd import numpy as np import random as rnd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from from from from from from from from

sklearn.linear_model import LogisticRegression sklearn.svm import SVC, LinearSVC sklearn.ensemble import RandomForestClassifier sklearn.neighbors import KNeighborsClassifier sklearn.naive_bayes import GaussianNB sklearn.linear_model import Perceptron sklearn.linear_model import SGDClassifier sklearn.tree import DecisionTreeClassifier

4.2.2 Importing Datasets Step 1

Read data.

Step 2

View data.

The first five rows of data are displayed.

The last five rows of data are displayed.

Page 31

Machine Learning Lab Guide-Student Version

Page 32

The data overview helps check whether some data is missing and what the data type is.

The related numeric-type information of the data helps check the average value and other statistics.

Machine Learning Lab Guide-Student Version

Page 33

The character-type information helps check the number of types, the type with the maximum value, and the frequency.

Step 3

Check the survival probability corresponding to each feature based on statistics.

The intuitive data shows that passengers in class 1 cabins are more likely to survive.

Machine Learning Lab Guide-Student Version

Page 34

The survival probability can be directly determined by the number of siblings.

When the survival probability is determined by gender, an obvious imbalance occurs.

As shown in the preceding figure, most young passengers died.

The following figure shows the survival probability determined based on the cabin and age.

Machine Learning Lab Guide-Student Version

Page 35

----End

4.2.3 Preprocessing Data As the datasets have missing values, combine the datasets, and fill the missing values with data.

Step 1

Combine the datasets.

Step 2

Check for missing values.

Machine Learning Lab Guide-Student Version

Step 3

Page 36

Fill the missing values with data. Process the datasets by using different methods as required. For example, fill the Fare and Embarked parameters having few missing values with the mode.

Use the average age value.

Delete less significant data. Before this, assign a value to Target first.

Check whether missing values still exist.

Step 4

Convert data. Convert some character-type data into numeric-type data for model input. To do so, check the number of types first.

Use the search function to obtain each character-type value and replace it with a numerictype value.

Machine Learning Lab Guide-Student Version

Page 37

test.csv cannot be used as a training test set as it does not contain Target. train.csv contains 891 pieces of data (with Target), which need to be extracted.

----End

4.2.4 Building a Model This section describes how to build a model. To build a model, split the training set and test set.

Step 1

Split the dataset. from sklearn.model_selection import train_test_split

Step 2

Train a model. The logistic regression algorithm, random forest algorithm, and AdaBoost are used for training. from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn import ensemble

As shown above, the random forest algorithm has a good effect.

Step 3

Predict data.

Machine Learning Lab Guide-Student Version

----End

Page 38

Machine Learning Lab Guide-Student Version

5

Page 39

Linear Regression

5.1 Introduction 5.1.1 About This Lab This experiment uses the basic Python code and the simplest data to reproduce how a linear regression algorithm iterates and fits the existing data distribution. The NumPy and Matplotlib modules are used in the experiment. NumPy is used for calculation, and Matplotlib is used for drawing.

5.1.2 Objectives Upon completion of this task, you will be able to: 

Be familiar with basic Python statements.



Master the procedure for implementing linear regression.

5.2 Procedure 5.2.1 Preparing Data Randomly set ten pieces of data, with the data in a linear relationship. Convert the data into an array format so that the data can be directly calculated when multiplication and addition are used. Code: # Import the required modules NumPy for calculation and Matplotlib for drawing. import numpy as np import matplotlib.pyplot as plt #This code is used only for Jupyter Notebook. %matplotlib inline # Define data and convert the list into an array.

# Display the data through the scatter chart.

Machine Learning Lab Guide-Student Version

Page 40

Output:

Figure 5-1 Scatter chart

5.2.2 Defining Related Functions Model function: defines the linear regression model wx+b. Loss function: calculates the mean square error. Optimization function: calculates the partial derivatives of w and b by using the gradient descent method. Code: # The basic linear regression model is wx+b. In this example, the model is ax+b as a two-dimensional space is used.

# The mean square error loss function is the most commonly used loss function in the linear regression model.

# The optimization function mainly uses the partial derivatives to update a and b.

# Perform function iteration to return a and b.

Machine Learning Lab Guide-Student Version

5.2.3 Starting Iteration Step 1

Initialize the iterative optimization model. Code: # Initialize and display parameters.

Output:

Figure 5-2 First iteration

Step 2

Perform the second iteration and display the parameter values, loss values, and visualization effect. Code:

Output:

Page 41

Machine Learning Lab Guide-Student Version

Figure 5-3 Second iteration

Step 3

Perform the third iteration and display the parameter values, loss values, and visualization effect. Code:

Output:

Page 42

Machine Learning Lab Guide-Student Version

Figure 5-4 Third iteration

Step 4

Perform the fourth iteration and display the parameter values, loss values, and visualization effect. Code:

Output:

Page 43

Machine Learning Lab Guide-Student Version

Page 44

Figure 5-5 Fourth iteration

Step 5

Perform the fifth iteration and display the parameter values, loss values, and visualization effect. Code:

Output:

Figure 5-6 Fifth iteration

Step 6

Perform the 10000th iteration and display the parameter values, loss values, and visualization effect. Code:

Output:

Machine Learning Lab Guide-Student Version

Figure 5-7 10000th iteration

----End

5.3 Thinking and Practices 5.3.1 Question 1 Must the loss value return to zero when the raw data is modified?

5.3.2 Question 2 What is the function of Lr during Lr modification?

Page 45

Machine Learning Lab Guide-Student Version

6

Page 46

Flower Category Analysis

6.1 Introduction 6.1.1 About This Lab This experiment uses a dataset with a small sample quantity. The dataset includes the open-source Iris data provided by scikit-learn. The Iris prediction project is a simple classification model. By using this model, you can understand the basic usage and data processing methods of the machine learning library sklearn.

6.2 Experiment Code 6.2.1 Importing Related Libraries import numpy as np import matplotlib.pyplot as plt

6.2.2 Importing a Dataset The dataset is the built-in data of sklearn. Therefore, no external dataset needs to be imported. from sklearn.datasets import load_iris

According to the preceding code, x is specified as a feature, and y as a label. The dataset includes a total of 150 samples and four features: sepal length, sepal width, petal length, and petal width.

6.2.3 Splitting the Dataset Split the data into a training set and a test set. from sklearn.model_selection import train_test_split

View the data size after the splitting.

Machine Learning Lab Guide-Student Version

Page 47

Data size before the splitting:

Data size after the splitting:

6.2.4 Performing Modeling 6.2.4.1 Logistic Regression Import the algorithm model to be used. from from from from

sklearn.linear_model import LogisticRegression sklearn.neighbors import KNeighborsClassifier sklearn import svm sklearn.tree import DecisionTreeClassifier

Logistic regression is used for modeling first. The one-vs-one (OvO) multiclass method is used for logistic regression by default.

As described above, logistic regression has a good effect.

6.2.4.2 SVM Use the Support Vector Machine (SVM) for classification. The one-vs-the-rest (OvR) multiclass method is used for the SVM by default.

Machine Learning Lab Guide-Student Version

Page 48

6.2.4.3 Decision Tree Use the decision tree algorithm.

6.2.4.4 K-Nearest Neighbors Algorithm Use the k-nearest neighbors algorithm.

Three neighbors are set for the k-nearest neighbors algorithm. Another number of neighbors can be tried for better accuracy. Therefore, the recursion method is used to find the optimal number of neighbors.

Machine Learning Lab Guide-Student Version

Page 49

As shown in the figure above, the k-nearest neighbors algorithm has the optimal effect when there is one nearest neighbor.

6.2.5 Effect After Data Preprocessing Consider data standardization before modeling. from sklearn.preprocessing import StandardScaler

After standardization, the standard deviation is 1, and the mean value is infinitely close to 0.

Then, use the SVM to perform modeling after the standardization. Change the data names of the training set and test set to new ones.

As described above, the SVM precision is also improved after the standardization.

Machine Learning Lab Guide-Student Version

7

Page 50

Emotion Recognition of Customer Evaluations in the Retail Industry

7.1 Introduction Emotion analysis is a classification technology based on natural language processing (NLP), and is usually used in classification methods for extracting emotional content of texts. Compared with related recommendation and precision marketing, users prefer to view or listen to the personal experience and feedback of users of the same type. For example, evaluations from users who have purchased similar products and comparison results from users who have used similar products can bring bidirectional values to users and enterprises. This experiment will discuss the problem and perform practice step by step from the perspectives of problem statement, breakdown, priority ranking, solution design, key point analysis, and summary and suggestions, and cultivate the project implementation thinking and implement analysis of the evaluation emotion analysis project from scratch.

7.1.1 Objectives Upon completion of this task, you will be able to: 

Clarify the function and business value of emotion analysis.



Understand the differences between conventional machine learning and deep learning in emotion analysis methods.



Clarify label extraction methods for emotion analysis.



Master deep learning-based emotion analysis methods.



Understand future applications of emotion analysis.

7.1.2 Case background The case in this document is for reference only. The actual procedure may vary. For details, see the corresponding product documents. Data engineer A works in the market data analysis department of a Hi-Tech company. The company plans to develop home appliance services, such as smart TV and smart reader, but it does not know how the current market is and how users evaluate such products. Therefore, the company wants the data department to output the market data survey report as soon as possible. Engineer A considers using the NLP technology to analyze users' evaluation tendency and evaluation keywords of competitors' products of the same type, and build an emotion prediction model to predict the users' emotion tendency based on texts.

Machine Learning Lab Guide-Student Version

Page 51

7.2 Procedure 7.2.1 Data Management The following information is involved: 

Id: ID



reviews.rating: score



reviews.text: text evaluation



reviews.title: evaluation keywords



reviews.username: name of the evaluator

This dataset contains 21 attribute fields and 34,657 data samples. The experiment aims to analyze customer evaluation data. Therefore, this document describes only the data attributes required in this experiment.

7.2.2 Data Reading After obtaining the provided data files, you need to read and view the data over Python by performing the following steps:

Step 1

Import common library files such as sklearn, pandas, and numpy. sklearn is a powerful third-party machine learning library of Python. It contains data in various aspects from data preprocessing to model training. Most functions in the sklearn library are classified into estimators and transformers. An estimator is equivalent to a modeling tool, and is used to predict data. Common estimator functions include fit(x,y) and predict(x). A transformer is used to process data, such as reducing dimensions and standardizing data. Common transformer functions include transform(x) and transform(x,y). import pandas as pd import matplotlib.pyplot as plt import matplotlib as mpl import nltk.classify.util from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn import metrics from sklearn.metrics import roc_curve, auc from nltk.classify import NaiveBayesClassifier import numpy as np import re import string import nltk %matplotlib inline

Step 2

Read data from a local disk.

Machine Learning Lab Guide-Student Version

Step 3

Page 52

Visualize the data. Visualize the first five rows of data and view the data attribute columns.

Output:

This experiment uses only the reviews.rating, reviews.text, reviews.username, and reviews.title attribute columns. Therefore, you need to extract data from the dataset for the second time to retain only the attribute data in the four columns, and name the extracted attribute data file permanent to facilitate the subsequent experiment procedure.

View the missing values.

Output:

The reviews.rating attribute column is indispensable to emotion analysis. The dataset contains 34,657 data samples. The data volume is large. Therefore, you can delete the data samples with the reviews.rating value missing. Specifically, you can extract the data without the reviews.rating value and name the data senti, and extract the data with the reviews.rating value and name the data check.

Machine Learning Lab Guide-Student Version

Page 53

With respect to score processing, this experiment defines data samples with the reviews.rating value greater than or equal to 4 as positive (pos) and those with the reviews.rating value less than 4 negative (neg), and renames the reviews.rating attribute column senti. replace(x,y): replaces x with y.

Visualize the data after identifying the samples as positive or negative.

Output:

The output shows that the data is unbalanced.

7.2.3 Data Processing A regular expression is used to check whether a string matches a pattern. The re module is added in Python 1.5 and later versions. The re module enables Python to have all regular expression functions: 

re.sub(): match item in the string to be replaced. re.sub(pattern, repl, string, count=0, flags=0)



pattern: pattern string in the regular expression.



repl: character string to be replaced, which can also be a function.



string: original character string to be searched for and replaced.



count: maximum number of replacements after pattern matching. The default value is 0, indicating that all matches need to be replaced.

Machine Learning Lab Guide-Student Version

Page 54

Data slicing is to use pandas.DataFrame.sample to randomly select several rows of data.

Step 1



n: number of rows to be extracted.



frac: proportion of rows to be extracted. If frac is set to 0.8, 80% of the rows need to be extracted.



replace: indicates whether extraction is performed with replacement. The value True indicates that extraction is performed with replacement.



random_state: seed of the random number generator. If random_state is set to None, the obtained data is not repeated.

Import related packages. import nltk.classify.util from nltk.classify import NaiveBayesClassifier import numpy as np import re import string import nltk

Step 2

Preprocess data. Text data includes spaces, punctuation marks, and data. This experiment focuses on the text (English) analysis. Therefore, you need to delete the information other than letters. You can define the cleanup() function, use a regular expression to delete non-letter characters, use the lower() function to convert uppercase letters into lowercase ones, and delete spaces, including '\n', '\r', '\t', and ' '. After apply() is used, the reviews.text attribute is saved as the summary_clean column.

Step 3

Generate a training set and a test set. Obtain ["Summary_Clean","senti"] from the senti dataset and save it as the split dataset.

Output:

Machine Learning Lab Guide-Student Version

Page 55

Use 80% of data in split as the training set through split.sample(), remove the data that has been used in the training set train from split through drop(), and use the remaining data as the test set test.

Output:

7.2.4 Model Training 7.2.4.1 Model Selection Emotion analysis of customer evaluations is essentially a classification problem, which can be solved by using a classification model. Practice has proved that a Naive Bayes model based on all words performs well in solving some problems, while a model using a word subset performs well in solving other problems. Logistic regression (LR), multinomial NB, and Bernouli NB are selected based on comprehensive consideration.

Machine Learning Lab Guide-Student Version

Page 56

7.2.4.2 Model Calculation and Evaluation The Naive Bayes classifier usually uses three models: Gaussian model, Polynomial model, and Bernoulli model. The three models respectively correspond to functions GaussianNB(), MultinomialNB(), and BernoulliNB() in sklearn. from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB



GaussianNB() is used when a feature is a continuous variable.



MultinomialNB() is used when a feature is a discrete variable.



BernoulliNB() is used when a feature is a discrete variable and the feature can be set only to 1 or 0.

The model calculation phase consists of the following steps:

Step 1

Import libraries. from wordcloud import STOPWORDS from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer

Step 2

Build and evaluate a model. Convert the data in the training set, test set, and verification set into a list and create indexes.

Set all words in train["words"] to True and add neg or pos to the end of a sentence based on the scoring criteria.

Use a trained classifier to attach emotion labels to the test set and verification set to predict whether words in the test set and verification set are positive or negative.

Machine Learning Lab Guide-Student Version

Page 57

Output:

Output:

The original dataset check does not contain review.ratings data. As shown in the preceding figure, whether each word is negative or positive is predicted after the classifier is created based on the training set. from sklearn.naive_bayes import MultinomialNB

Use the CountVectorizer class to perform vectorization, invoke the TfidfTransformer class to perform preprocessing, construct the term frequency (TF) vector, and calculate the importance of words. The training set, test set, and verification set are obtained, which are X_train_tfidf, X_test_tfidf, and checktfidf, respectively. The main idea of TF is as follows: If a word or phrase has a high TF in an article but a low TF in other retail articles, the word or phrase is considered to have a good class distinguishing capability. TF-IDF tends to filter out commonly used words and retain important words.

Machine Learning Lab Guide-Student Version

Page 58

The CountVectorizer class converts words in the text into a TF matrix, and uses the fit_transform() function to calculate the number of appearance times of each word. In general, you can use CountVectorizer to extract features and then use TfidfTransformer to calculate the weight of each feature.

Retain important words based on tfidf_transformer to construct the MultinomialNB model.

Output:

Retain important words based on tfidf_transformer to construct the BernoulliNB model. from sklearn.naive_bayes import BernoulliNB

Output:

Retain important words based on tfidf_transformer to construct the LR model. from sklearn import linear_model

Output:

In comparison, the LR model has higher accuracy than the other two models.

Step 3

Verify the model. Select the LR model for verification.

Machine Learning Lab Guide-Student Version

Page 59

Output:

The classifier accurately provides the positive probability and negative probability of each sentence.

Step 4

Build a word cloud. from wordcloud import WordCloud, STOPWORDS

Output:

----End

Machine Learning Lab Guide-Student Version

8

Page 60

Boston Housing Price Forecast

8.1 Introduction 8.1.1 About This Lab This experiment uses a dataset with a small sample quantity. The dataset includes the open-source Boston housing price data provided by scikit-learn. The Boston housing price forecast project is a simple regression model. By using this model, you can understand the basic usage and data processing methods of the machine learning library sklearn.

8.1.2 Objectives Upon completion of this task, you will be able to: 

Use the Boston housing price dataset open to the Internet as the model input data.



Build, train, and evaluate machine learning models



Understand the overall process of building a machine learning model.



Master the application of machine learning model training, grid search, and evaluation indicators.



Master the application of related APIs.

8.1.3 Experiment Dataset and Framework This experiment is based on the Boston housing price dataset, which contains 506 samples with 13 features. Each data record contains detailed information about the house and its surroundings. To be specific, the dataset includes the following features: 

CRIM: per capita crime rate by town



ZN: proportion of residential land zoned for lots over 25,000 sq.ft



INDUS: proportion of non-retail business acres per town



CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)



NOX: Nitric oxide concentration



RM: average number of rooms per dwelling



AGE: proportion of owner-occupied units built prior to 1940



DIS: weighted distances to five Boston employment centers



RAD: index of accessibility to radial highways



TAX: full-value property-tax rate per $10,000



PTRATIO: pupil-teacher ratio by town

Machine Learning Lab Guide-Student Version



B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town



LSTAT: % lower status of the population

Page 61

The target is to obtain the median value of owner-occupied homes in the unit of $1000. The sklearn framework is used to provide the Boston housing price data and functions such as dataset splitting, standardization, and evaluation, and integrate various common machine learning algorithms. In addition, XGBoost optimized from gradient boosted decision tree (GBDT) is used as the integral algorithm.

8.2 Procedure 8.2.1 Introducing the Dependency Code: #Prevent unnecessary warnings. import warnings warnings.filterwarnings("ignore") #Introduce the basic package of data science. import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import pandas as pd import scipy.stats as st import seaborn as sns ##Set attributes to prevent garbled characters in Chinese.

#Introduce machine learning, preprocessing, model selection, and evaluation indicators. from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.metrics import r2_score #Import the Boston dataset used this time. from sklearn.datasets import load_boston #Introduce algorithms. from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression, ElasticNet #Compared with SVC, it is the regression form of SVM. from sklearn.svm import SVR #Integrate algorithms. from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from xgboost import XGBRegressor

Machine Learning Lab Guide-Student Version

Page 62

8.2.2 Loading the Dataset, Viewing Data Attributes, and Visualizing Data Step 1

Load the Boston housing price dataset and display related attributes. Code: #Load the Boston house price dataset.

#x features, and y labels.

#Display related attributes.

Output: Feature column names: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' RM' 'AGE' DIS' 'RAD' 'TAX' PTRATIO' 'B' 'LSTAT'], sample quantity: 506, feature quantity: 13, target sample quantity: 506

Step 2

Convert the data into the data frame format Code:

Output:

Figure 8-1 Information about the first five samples

Step 3

Visualize the label distribution. Code:

Machine Learning Lab Guide-Student Version

Page 63

Output:

Figure 8-2 Target data distribution ----End

8.2.3 Splitting and Preprocessing the Dataset Code: #Segment the data. #Standardize the dataset.

Output: Arrays: ([–0.35451414, –0.49503678, –0.15692398, ..., –0.01188637, 0.42050162, –0.29153411], [– 0.38886418, –0.49503678, –0.02431196, ..., 0.35398749, 0.37314392, –0.97290358], [0.50315442, – 0.49503678, 1.03804143, ..., 0.81132983, 0.4391143, 1.18523567], ..., [–0.34444751, –0.49503678, – 0.15692398, ..., –0.01188637, 0.4391143, –1.11086682], [–0.39513036, 2.80452783, –0.87827504, ..., 0.35398749, 0.4391143, –1.28120919], [–0.38081287, 0.41234349, –0.74566303, ..., 0.30825326, 0.19472652, –0.40978832]])

8.2.4 Performing Modeling on the Dataset by Using Various Regression Models Code: #Set the model name.

#Define the model.

Machine Learning Lab Guide-Student Version

Page 64

# cv is the cross-validation idea here.

# Output the R2 scores of all regression models. #Define the R2 scoring function.

#Traverse all models to score.

Output:

8.2.5 Adjusting Grid Search Hyperparameters Step 1

Build a model. Code: ''' 'kernel': kernel function 'C': SVR regularization factor 'gamma': 'rbf', 'poly' and 'sigmoid' kernel function coefficient, which affects the model performance '''

#Use grid search and perform cross validation.

Output:

Machine Learning Lab Guide-Student Version

Step 2

Obtain the optimal parameters. Code:

Output: Optimal parameter list: {'C': 5, 'gamma': 0.1, 'kernel': 'rbf'} Optimal model: SVR(C=5, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1, kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False) Optimal R2 value: 0.797481706635164

Step 3

Visualize the output. Code: ##Perform visualization.

#Set the canvas.

#Draw with a red solid line.

#Draw with a green solid line.

#Display in a diagram.

Output:

Page 65

Machine Learning Lab Guide-Student Version

Figure 8-3 Visualized result ----End

Page 66

Machine Learning Lab Guide-Student Version

9

Page 67

E-commerce Website User Group Analysis

9.1 Introduction 9.1.1 About This Lab This experiment performs modeling based on the k-means algorithm by using the virtual dataset automatically generated by sklearn to obtain user categories. It is a clustering experiment, which can find out the method for selecting the optimal k value and observe the effect in a visualized manner.

9.2 Experiment Code 9.2.1 Using sklearn for Modeling Step 1

Import libraries. import numpy as np import matplotlib.pyplot as plt

Step 2

Create a dataset. Create virtual data for the algorithm model. from sklearn.datasets import make_blobs

The built-in tool of sklearn is used to create the virtual data, which is scientific and conforms to a normal distribution. Parameter settings are as follows: 

n_samples: set to 2000, indicating that 2000 sample points are set.



centers: set to 2, indicating that the data actually has two centers.



n_features: set to 2, indicating the number of features.

For ease of illustration in the coordinate system, only two features are used.

Step 3

Apply the k-means algorithm. from sklearn.cluster import KMeans

Machine Learning Lab Guide-Student Version

Page 68

n_clusters=5: indicates that five data clusters are expected. However, there are only two data categories.

Step 4

Visualize the output. import matplotlib.pyplot as plt

Output:

Different data is generated each time. Therefore, the output diagram may be different from that in the lab. To generate the same data, add the random_state parameter during data generation.

In this example, random_state is set to 3. In this way, the same data can be generated for the same data input.

Machine Learning Lab Guide-Student Version

Step 5

Page 69

Use more features for comparison.

In this example, ten features are used to generate data, random_state is set to 30, and there are three categories in theory.

----End

Machine Learning Lab Guide-Student Version

Page 70

9.2.2 Selecting the Optimal k Value In the preceding steps, the k value is manually set. In actual environments, the number of centers is unknown. Therefore, you need to find the optimal k value. import random

First, generate two random numbers ranging from 1 to 30 (indicating that the number of true centers in the data is unknown), and use a random number of features.

Then, perform k-means clustering by using a recursive method. The .inertia_ attribute returns the distance from the attribute point to the center.

Visualize the result by using a visualization tool.

The result varies each time due to impact of the random numbers. As shown in the preceding figure, the turning point appears at the position corresponding to the value 21. Therefore, 21 is the optimal k value.

03 Machine Learning Lab Guide-Student Version [PDF]

Artificial Intelligence Technology and Application Machine Learning Lab Guide Teacher Version Huawei Technologies CO.,

Machine Learning

04 Deep Learning Lab Guide-Student Version

Dicoding Machine Learning (21072021)

Machine Learning Workflow

Pembelajaran Machine Learning

Kelompok 4 Machine Learning

Pyhon Machine Learning

Peer Review Jurnal Machine Learning

Daftar Isi Buku Machine Learning

The Machine Learning Journey PDF

File loading please wait...

Citation preview

Artificial Intelligence Technology and Application

Machine Learning Lab Guide Teacher Version

Huawei Technologies CO.,