In this code, we use LSTM for recognizing origins of sequence of words. The data are explored and found an imbalanced structure over three classes; French and Spanish samples dominate the set whereas English samples only present a small partial. We resample the data so that the new training set is balanced over the three classes. Data is then tokenized in Keras tokenization module. The model reaches a good performance in both LSTM and GRU with less than 1% misclassification. We have observed a 4% overfitting and adding dropout layer drastically decreased model performance. A novel approach is proposed to make confident prediction based on a “bootstrapped” modelling process. This is well illustrated and discussed in section 3.5. Finally, alternative approaches for dealing with imbalanced data is discussed in section 3.6. Copy of the basic code for this exercise can be found in file main_file.py.
Two sets of data are provided, labeled training data and unlabeled data. We shall build models on labeled data and predict the language for that of unlabeled. The training data contains one feature column and one target variable which is categorical. Amongst 10000 observations in the training set, we found samples are unevenly collected. French and Spanish dominated in number, specifically, 4338 and 4542, respectively. English, however, only has 1120 samples. Given almost 1:4 imbalance, we decided to preprocess the data. A simple solution is taken in this exercise, that is, French and Spanish samples are randomly redrawn 1120 times without replacement from the original data set. The final training data set is a combination of the two newly generated plus English. Hence, total 3360 observations are included and the ratio turns to 1:1:1. The second important requirement for LSTM is each input sequence should have uniform length. The data sets are well collected in this regard with each segmentation having 40 characters. A further investigation reveals that a total number of 104 different characters appears in our sample. Proper tokenization shows the character to index as follows: {'X': 74, 'a': 3, '6': 103, 'C': 45, 's': 6, '-': 29, '2': 87, ']': 79, 'U': 65, 'l': 10, 'ï': 95, '!': 57, 'q': 22, ' ': 1, 'g': 18, 'O': 50, "'": 27, 'b': 19, '1': 80, '«': 86, 'È': 99, '.': 26, 'Ó': 104, 'É': 88, '\x9c': 90, 'à': 41, 'H': 47, 'Á': 97, ')': 84, 'E': 35, '¡': 78, 'ë': 98, '[': 85, 'h': 13, 'V': 64, '0': 89, '»': 81, 'o': 4, 'K': 91, 'û': 76, 'J': 69, '?': 59, 'ô': 70, 'ñ': 55, 'T': 36, 'Y': 68, '_': 72, 'c': 14, 'ú': 71, 'x': 34, 'ê': 56, 'F': 58, 'Ê': 100, 'ù': 73, 'S': 30, 'é': 25, 'B': 63, 'e': 2, 'á': 46, 'í': 32, 'm': 15, ':': 54, 'd': 11, 'W': 60, 'î': 66, 'ç': 75, 'j': 28, 'n': 5, 'u': 12, 'i': 9, '3': 92, '(': 83, 'è': 48, 'R': 49, '9': 102, 'p': 17, ';': 37, '7': 101, 'v': 20, 'G': 67, 'I': 43, 'Q': 38, 'z': 40, 't': 7, 'k': 33, 'L': 51, 'y': 23, 'N': 44, 'M': 53, '5': 93, 'â': 62, '¿': 77, 'w': 24, 'ó': 42, 'A': 39, 'P': 52, '"': 61, '8': 96, '4': 94, ',': 16, 'Z': 82, 'f': 21, 'r': 8, 'D': 31} Note that this dictionary keeps both upper and lower letters which may be considered as redundant information. But executions over LSTM algorithm on both inclusive dictionary and exclusive dictionary shows no change in performance. Since we shall give results to unlabeled data set, and its accuracy can never be fully certain, we divide our training sample into three components, training (70%), validating (15%), and testing(15%). It is so organized that we can train and tune our model in an unbiased way based on the training and validating set. Once a model is selected, it’s accuracy can be estimated based on our labeled testing set. An important stage in this exercise is data tokenization and reshaping so that the input will meet LSTM model requirement. This is done through Keras tokenization, and converting to np.array, the input for a single observation looks like the follows: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0 …… [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] This numpy array has dimension [40, 105], where each character of 40 represented by a one-hot vector of 105 entries (104 unique characters plus one). The above np.array is a tokenization for the segmentation <<é à choisir les armes; ici c'est le défi>>. We can verify that the first two characters é and ‘ ’ have number 25 and 1 in our character to index dictionary. Thus the entry 25 in the first vector and 1 in the second are 1, whereas the rest of entries are zeros (note that keras.tokenization add one extra entry for every vector through texts_to_matrix to the total number of unique characters in dictionary). The final data has the shape [num_time_serise,one_hot_vector_dimension, batch_size]. As for the target variable: language, we use pandas dummy to convert the three categories into three one-hot variables.
Building on the input layer, we give 10 LSTM cells, followed by 10 Dense with softplus activation and finally, three Dense with softmax to output probability. Next, we switch our LSTM cells to GRU, and the results will be discussed in the next section. In both models, results are compared between the presence and absence of dropout layer.
In our LSTM model, we train via training set, and validating through validating set. Without dropout, the algorithm converges within first 120 epoch yielding less than 1% misclassification. On the other hand, we have observed a level of roughly 4% overfitting. The validation accuracy is above 95.24%. Model output should look like the follows: Using TensorFlow backend. If the features in training and testing set have the same length: True Epoch 1/1000 40/2353 [..............................] - ETA: 40s - loss: 1.1336 - acc: 0.3250 ……. 2320/2353 [============================>.] - ETA: 0s - loss: 2.2749e-07 - acc: 1.0000 2353/2353 [==============================] - 1s - loss: 2.2597e-07 - acc: 1.0000 Validation accuracy: 95.24% Test accuracy: 97.81% Switching to GRU units shows no significant difference over performance. On the other hand, adding a dropout layer does not increase the test accuracy. Even setting the probability to 0.8 yields an accuracy 94.43%, lower than the aforementioned model. We shall discuss this in the last section.
On the unlabeled data test set, we give two predictions based on our models. The first one is a model with test accuracy 94.83% and the second was based on the model we enclosed in the previous section: one with expected accuracy 97.81%. Readers shall refer to file prediction_data_test_language.csv for a complete result. Multiple prediction would provide extra information, and we discuss in the next section.
The approach we take resembles the bootstrap process for calculating confidence interval of means in statistics. Here, we “bootstrap” our model through varying randomized training data. Practically, since our algorithm uses randomly sample data as model input, the weights and biases of our LSTM will change from each session run. Namely, different set of input language sequence gives rise to different LSTM model. Running N times will generate N models, and in turn, N sets of predictions over the unlabeled test set. Hence, we are able to reduce overfitting by selecting the most frequent class for each test instance as the final prediction (increasing confidence w.r.t. only one model). To illustrate, consider Figure 6 which summarizes our results and at the same time, demonstrates the pipeline of our novel approach.
Figure 1. An novel approach for a confident prediction based on model bootstrap (Figure 1) In the first is our unlabeled test set, the second part is the predictions from n models, and the third part calculates the occurrence of each classes over all n models, i.e., frequency. Within limited time, we have only run two models and their prediction occurrence is summarized. We will find entries where number 2 are placed, meaning the corresponding class for that of feature instances are confident. On the other hand, there appears 1’s, meaning the two models differ their opinion and at least one of them is wrong. This confidence is completely quantifiable, and with say, 100 models with 100 predictions, we can calculate the level of confidence based on the ratio of the most occurred class to 100. In future analysis, this technique should play a part to reduce overfitting and increase the predictive confidence.
Due to limited time allowance. The strategy we have used in the preprocessing stage is to randomly sample in the data set so that the three categories have a balance presence for our training set. However, this gives rise to a very small training set with only 3360 observations. Some alternatives worth mentioning. Firstly, acquiring more English data would be beneficial. There exists enormous amount of English literature on line, sampling and collecting sequences as long as 40 characters would be very handy. The new sample would complement our original data set, forming a balanced training set. In the setting of this code, a more appropriate approach takes the scripts of Don Quixote, and sample sequence from it. One can access an entire copy from http://www.donquixote.com/english.html. There also exists conditions where sampling online source is not feasible, which may be due to, for example, legal issue or simply no labeled data available. An approach is to create data based through augmenting existing samples. For example, we could collect that 1120 English sequences as our population. Randomly ordering them and sampling each time a 40-character long sequence of words. This will form an augmented data. A balanced training set can be achieved.