Machine Learning: June 2016

I have been slogging my way through this for about a week.
It has been a crash course in Python as well as Tensor Flow.
Here are my thoughts so far and attempts at solving the problems.

****************************

First, we'll download the dataset to our local machine. The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to 'A' through 'J' (10 classes). The training set has about 500k and the testset 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine.
Extract the dataset from the compressed .tar.gz file. This should give you a set of directories, labelled A through J.

Problem 1:
Display a sample of the images downloaded.
My Solution:
Hint was to use iPython.display. Got it.
Through some Python trickery I can access the OS - cool.
\\\\\\\\\\\\\\\\\\\\\\\\\\\
from IPython.display import Image

Image(filename=os.path.join(train_folders[0],os.listdir(train_folders[0])[8]))
\\\\\\\\\\\\\\\\\\\\\\\\\\\
This displays the 8th image in the first folder ("A"s)
******************************

"Now let's load the data in a more manageable format. Since, depending on your computer setup you might not be able to fit it all in memory, we'll load each class into a separate dataset, store them on disk and curate them independently. Later we'll merge them into a single dataset of manageable size.
We'll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road.
A few images might not be readable, we'll just skip them."

Figuring out the provided code was a BEAST. Figuring out how to add my own comments was even worse. What's the difference between # and triples quotes???? Funny how the simplest things trip you up.
Here's the provided code with my markups/questions (won't run):

image_size = 28 # Pixel width and height.

pixel_depth = 255.0 # Number of levels per pixel.

def load_letter(folder, min_num_images):

"""Load the data for a single letter label."""

"""image_files is an array of all the filenames"""

image_files = os.listdir(folder)

"""dataset is an array of length being the total number of images, and each image is 28x28"""

dataset = np.ndarray(shape=(len(image_files), image_size, image_size),

dtype=np.float32)

print(folder)

num_images = 0

for image in image_files:

image_file = os.path.join(folder, image)

try:

"""this is the normalization step - the formula is [value-(255/2)]/255"""

QUESTION - does this normalize to have approximately zero mean and standard deviation ~.5?

image_data = (ndimage.imread(image_file).astype(float) -

pixel_depth / 2) / pixel_depth

if image_data.shape != (image_size, image_size):

raise Exception('Unexpected image shape: %s' % str(image_data.shape))

"""after the normalization, stick the normalized image into the dataset array at the nth position"""

dataset[num_images, :, :] = image_data

num_images = num_images + 1

except IOError as e:

print('Could not read:', image_file, ':', e, '- it\'s ok, skipping.')

dataset = dataset[0:num_images, :, :]

if num_images < min_num_images:

raise Exception('Many fewer images than expected: %d < %d' %

(num_images, min_num_images))

print('Full dataset tensor:', dataset.shape)

print('Mean:', np.mean(dataset))

print('Standard deviation:', np.std(dataset))

return dataset

def maybe_pickle(data_folders, min_num_images_per_class, force=False):

dataset_names = []

for folder in data_folders:

set_filename = folder + '.pickle'

dataset_names.append(set_filename)

if os.path.exists(set_filename) and not force:

# You may override by setting force=True.

print('%s already present - Skipping pickling.' % set_filename)

else:

print('Pickling %s.' % set_filename)

dataset = load_letter(folder, min_num_images_per_class)

try:

with open(set_filename, 'wb') as f:

"""dumps it all into 1 file, need to read back to use it"""

pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)

except Exception as e:

print('Unable to save data to', set_filename, ':', e)

"""dataset_names is an array of names of pickled files, the name of the pickled dataset is the folder_name.pickle"""

return dataset_names

train_datasets = maybe_pickle(train_folders, 45000)

test_datasets = maybe_pickle(test_folders, 1800)

Problem 2:
Display a sample of the labels and images from the ndarray.
My Solution:
Hint was to use matplotlib.pyplot. Got it. Except for the "pickle" part - it must be LOADED before it can be accessed. That was the trick.
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

test_dataset_p = pickle.load(open(test_datasets[0], "rb"))
imgplot = plt.imshow(test_dataset_p[0])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
This displays the first image of the first folder ("A"s) for the test_dataset.
***********************************************
Problem 3
Check that the data is balanced across classes.
Question - is it that the length of the array for each dataset is approximately the same?
**********************************************

Merge and prune the training data as needed. Depending on your computer setup, you might not be able to fit it all in memory, and you can tune train_size as needed. The labels will be stored into a separate array of integers 0 through 9.
Also create a validation dataset for hyperparameter tuning.
# creates 2 blank arrays for each class - 1 3D array for the data, and 1 array for the labels

def make_arrays(nb_rows, img_size):
if nb_rows:
    dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)
    labels = np.ndarray(nb_rows, dtype=np.int32)
else:
    dataset, labels = None, None
return dataset, labels

# merge samples from each class into 1 dataset
def merge_datasets(pickle_files, train_size, valid_size=0):
num_classes = len(pickle_files)
valid_dataset, valid_labels = make_arrays(valid_size, image_size)
train_dataset, train_labels = make_arrays(train_size, image_size)
# breakdown how many samples to take from each class
vsize_per_class = valid_size // num_classes
tsize_per_class = train_size // num_classes

start_v, start_t = 0, 0
end_v, end_t = vsize_per_class, tsize_per_class
end_l = vsize_per_class+tsize_per_class
# for each pickled file in a group (training or validation), open and load each pickled file and randomize it,
# then pull out a fixed number from each class and put in in the merged dataset, and populate the corresponding
# label array
for label, pickle_file in enumerate(pickle_files):
    try:
      with open(pickle_file, 'rb') as f:
        letter_set = pickle.load(f)
        # let's shuffle the letters to have random validation and training set
        np.random.shuffle(letter_set)
#         first populate the validation dataset
        if valid_dataset is not None:
#         pull out a certain number of images from the randomized set
          valid_letter = letter_set[:vsize_per_class, :, :]
#     stick them into the dataset which will have data from all classes
          valid_dataset[start_v:end_v, :, :] = valid_letter
#     fill in the array of labels - all the same since from the same class"
          valid_labels[start_v:end_v] = label
#     increment the pointers to prepare for the next class
          start_v += vsize_per_class
          end_v += vsize_per_class

#             now populate the training dataset
# start where the v_set left off, and pull out another set number of images from the randomized set, cont. as before
        train_letter = letter_set[vsize_per_class:end_l, :, :]
        train_dataset[start_t:end_t, :, :] = train_letter
        train_labels[start_t:end_t] = label
        start_t += tsize_per_class
        end_t += tsize_per_class
    except Exception as e:
      print('Unable to process data from', pickle_file, ':', e)
      raise

return valid_dataset, valid_labels, train_dataset, train_labels


train_size = 200000
valid_size = 10000
test_size = 10000

# create 1 dataset and 1 label array for each: validation, training, and testing
valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(
train_datasets, train_size, valid_size)
_, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)

print('Training:', train_dataset.shape, train_labels.shape)
print('Validation:', valid_dataset.shape, valid_labels.shape)
print('Testing:', test_dataset.shape, test_labels.shape)

def randomize(dataset, labels):
#     permutes the order of the dataset and corresponding labels
permutation = np.random.permutation(labels.shape[0])
shuffled_dataset = dataset[permutation,:,:]
shuffled_labels = labels[permutation]
return shuffled_dataset, shuffled_labels

train_dataset, train_labels = randomize(train_dataset, train_labels)
test_dataset, test_labels = randomize(test_dataset, test_labels)
valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)

Problem 4:
Show data is still good after shuffling
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

imgplot = plt.imshow(train_dataset[20])
print(train_labels[20])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Displays the merged/randomized data and the corresponding label (0 = A, etc)
********************************************
By construction, this dataset might contain a lot of overlapping samples, including training data that's also contained in the validation and test set! Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it. Measure how much overlap there is between training, validation and test samples.
Optional questions:

What about near duplicates between datasets? (images that are almost identical)
Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.

Problem 5:
Question
# the training/validation data comes from NMNIST_large, and the testing data from NMNIST_small - why would there be
# overlap between these two sets? As for the overlap between training and validation, the dataset is randomized, and
# then validation is [0,1000] and train is [1000-21000] - again, why would there be overlap?
***************************************************

Let's get an idea of what an off-the-shelf classifier can give you on this data. It's always good to check that there is something to learn, and that it's a problem that is not so trivial that a canned solution solves it.
Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.
Optional question: train an off-the-shelf model on all the data!
Problem 6:
My first attempt was this:
lm = linear_model.LinearRegression()
lm.fit(train_dataset[:50], train_labels[:50])
But I got an error, it was expecting a 2D array and I gave it 3 values (as the images are a 2D array and the labels are 1D)
Off to search the Internet... returned with victory!
From an example working with images;
"To use this dataset with the scikit, we transform each 8x8 image into a vector of length 64"
I must therefore transform my 28x28 image into a vector of length 784.
QUESTION - what is this doing exactly?
data = train_dataset.reshape((train_dataset.shape[0], -1))
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

data = train_dataset.reshape((train_dataset.shape[0], -1))

lm = linear_model.LinearRegression()

lm.fit(data[:-50], train_labels[:-50])
QUESTION - what does the -50 notation mean?
After I run this, then I get:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

QUESTION - what is this? I get the coef but not the intercept
print (lm.intercept_)
print (len(lm.coef_))

5.49250064066
784

lm.predict(data)[0:5]

array([ 3.48318346,  9.21114963,  4.33268172,  2.78903009,  5.03966479])

print(train_labels[0:5])

[4 9 6 2 7]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

I checked the model prediction vs the actual labels - some are close
Increasing the samples to 5000:

array([ 3.5022907 ,  9.20476316,  4.30405357,  2.79587956,  5.07279521])

Not much better...
Also tried on valid dataset and test datasets, similar results

References:
http://www.scipy-lectures.org/advanced/scikit-learn/
http://bigdataexaminer.com/uncategorized/how-to-run-linear-regression-in-python-scikit-learn/

Machine Learning

Wednesday, June 29, 2016

Udacity - Assignment 1 - not MNIST - update

Tuesday, June 28, 2016

Udacity - Assignment 1 - not MNIST