It has been a crash course in Python as well as Tensor Flow.
Here are my thoughts so far and attempts at solving the problems.
****************************
First,
we'll download the dataset to our local machine. The data consists of
characters rendered in a variety of fonts on a 28x28 image. The labels
are limited to 'A' through 'J' (10 classes). The training set has about
500k and the testset 19000 labelled examples. Given these sizes, it
should be possible to train models quickly on any machine.
Extract the dataset from the compressed .tar.gz file. This should give you a set of directories, labelled A through J.
Extract the dataset from the compressed .tar.gz file. This should give you a set of directories, labelled A through J.
Display a sample of the images downloaded.
My Solution:
Hint was to use iPython.display. Got it.
Through some Python trickery I can access the OS - cool.
\\\\\\\\\\\\\\\\\\\\\\\\\\\
from IPython.display import Image
Image(filename=os.path.join(train_folders[0],os.listdir(train_folders[0])[8]))
\\\\\\\\\\\\\\\\\\\\\\\\\\\
This displays the 8th image in the first folder ("A"s)
******************************
"Now let's
load the data in a more manageable format. Since, depending on your
computer setup you might not be able to fit it all in memory, we'll load
each class into a separate dataset, store them on disk and curate them
independently. Later we'll merge them into a single dataset of
manageable size.
We'll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road.
A few images might not be readable, we'll just skip them."
Figuring out the provided code was a BEAST. Figuring out how to add my own comments was even worse. What's the difference between # and triples quotes???? Funny how the simplest things trip you up.We'll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road.
A few images might not be readable, we'll just skip them."
Here's the provided code with my markups/questions (won't run):
image_size = 28 #
Pixel width and height.
pixel_depth = 255.0
# Number of levels per pixel.
def
load_letter(folder, min_num_images):
"""Load
the data for a single letter label."""
"""image_files
is an array of all the filenames"""
image_files =
os.listdir(folder)
"""dataset
is an array of length being the total number of images, and each
image is 28x28"""
dataset =
np.ndarray(shape=(len(image_files), image_size, image_size),
dtype=np.float32)
print(folder)
num_images = 0
for image in
image_files:
image_file =
os.path.join(folder, image)
try:
"""this
is the normalization step - the formula is [value-(255/2)]/255"""
QUESTION - does this normalize to have approximately zero mean and standard deviation ~.5?
image_data =
(ndimage.imread(image_file).astype(float) -
pixel_depth / 2) / pixel_depth
if
image_data.shape != (image_size, image_size):
raise
Exception('Unexpected image shape: %s' % str(image_data.shape))
"""after
the normalization, stick the normalized image into the dataset array
at the nth position"""
dataset[num_images, :, :] = image_data
num_images =
num_images + 1
except IOError
as e:
print('Could
not read:', image_file, ':', e, '- it\'s ok, skipping.')
dataset =
dataset[0:num_images, :, :]
if num_images <
min_num_images:
raise
Exception('Many fewer images than expected: %d < %d' %
(num_images, min_num_images))
print('Full
dataset tensor:', dataset.shape)
print('Mean:',
np.mean(dataset))
print('Standard
deviation:', np.std(dataset))
return dataset
def
maybe_pickle(data_folders, min_num_images_per_class, force=False):
dataset_names = []
for folder in
data_folders:
set_filename =
folder + '.pickle'
dataset_names.append(set_filename)
if
os.path.exists(set_filename) and not force:
# You may
override by setting force=True.
print('%s
already present - Skipping pickling.' % set_filename)
else:
print('Pickling %s.' % set_filename)
dataset =
load_letter(folder, min_num_images_per_class)
try:
with
open(set_filename, 'wb') as f:
"""dumps
it all into 1 file, need to read back to use it"""
pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL)
except
Exception as e:
print('Unable to save data to', set_filename, ':', e)
"""dataset_names
is an array of names of pickled files, the name of the pickled dataset is the
folder_name.pickle"""
return
dataset_names
train_datasets =
maybe_pickle(train_folders, 45000)
test_datasets =
maybe_pickle(test_folders, 1800)
Problem 2:
Display a sample of the labels and images from the ndarray.
My Solution:
Hint was to use matplotlib.pyplot. Got it. Except for the "pickle" part - it must be LOADED before it can be accessed. That was the trick.
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
test_dataset_p = pickle.load(open(test_datasets[0], "rb"))
imgplot = plt.imshow(test_dataset_p[0])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
This displays the first image of the first folder ("A"s) for the test_dataset.
***********************************************
Problem 3
Check that the data is balanced across classes.
Question - is it that the length of the array for each dataset is approximately the same?
**********************************************
Merge and
prune the training data as needed. Depending on your computer setup, you
might not be able to fit it all in memory, and you can tune
Also create a validation dataset for hyperparameter tuning.
# creates 2 blank arrays for each class - 1 3D array for the data, and 1 array for the labels
def make_arrays(nb_rows, img_size):
if nb_rows:
dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)
labels = np.ndarray(nb_rows, dtype=np.int32)
else:
dataset, labels = None, None
return dataset, labels
# merge samples from each class into 1 dataset
def merge_datasets(pickle_files, train_size, valid_size=0):
num_classes = len(pickle_files)
valid_dataset, valid_labels = make_arrays(valid_size, image_size)
train_dataset, train_labels = make_arrays(train_size, image_size)
# breakdown how many samples to take from each class
vsize_per_class = valid_size // num_classes
tsize_per_class = train_size // num_classes
start_v, start_t = 0, 0
end_v, end_t = vsize_per_class, tsize_per_class
end_l = vsize_per_class+tsize_per_class
# for each pickled file in a group (training or validation), open and load each pickled file and randomize it,
# then pull out a fixed number from each class and put in in the merged dataset, and populate the corresponding
# label array
for label, pickle_file in enumerate(pickle_files):
try:
with open(pickle_file, 'rb') as f:
letter_set = pickle.load(f)
# let's shuffle the letters to have random validation and training set
np.random.shuffle(letter_set)
# first populate the validation dataset
if valid_dataset is not None:
# pull out a certain number of images from the randomized set
valid_letter = letter_set[:vsize_per_class, :, :]
# stick them into the dataset which will have data from all classes
valid_dataset[start_v:end_v, :, :] = valid_letter
# fill in the array of labels - all the same since from the same class"
valid_labels[start_v:end_v] = label
# increment the pointers to prepare for the next class
start_v += vsize_per_class
end_v += vsize_per_class
# now populate the training dataset
# start where the v_set left off, and pull out another set number of images from the randomized set, cont. as before
train_letter = letter_set[vsize_per_class:end_l, :, :]
train_dataset[start_t:end_t, :, :] = train_letter
train_labels[start_t:end_t] = label
start_t += tsize_per_class
end_t += tsize_per_class
except Exception as e:
print('Unable to process data from', pickle_file, ':', e)
raise
return valid_dataset, valid_labels, train_dataset, train_labels
train_size = 200000
valid_size = 10000
test_size = 10000
# create 1 dataset and 1 label array for each: validation, training, and testing
valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(
train_datasets, train_size, valid_size)
_, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)
print('Training:', train_dataset.shape, train_labels.shape)
print('Validation:', valid_dataset.shape, valid_labels.shape)
print('Testing:', test_dataset.shape, test_labels.shape)
def randomize(dataset, labels):
# permutes the order of the dataset and corresponding labels
permutation = np.random.permutation(labels.shape[0])
shuffled_dataset = dataset[permutation,:,:]
shuffled_labels = labels[permutation]
return shuffled_dataset, shuffled_labels
train_dataset, train_labels = randomize(train_dataset, train_labels)
test_dataset, test_labels = randomize(test_dataset, test_labels)
valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)
Problem 4:train_size
as needed. The labels will be stored into a separate array of integers 0 through 9.Also create a validation dataset for hyperparameter tuning.
# creates 2 blank arrays for each class - 1 3D array for the data, and 1 array for the labels
def make_arrays(nb_rows, img_size):
if nb_rows:
dataset = np.ndarray((nb_rows, img_size, img_size), dtype=np.float32)
labels = np.ndarray(nb_rows, dtype=np.int32)
else:
dataset, labels = None, None
return dataset, labels
# merge samples from each class into 1 dataset
def merge_datasets(pickle_files, train_size, valid_size=0):
num_classes = len(pickle_files)
valid_dataset, valid_labels = make_arrays(valid_size, image_size)
train_dataset, train_labels = make_arrays(train_size, image_size)
# breakdown how many samples to take from each class
vsize_per_class = valid_size // num_classes
tsize_per_class = train_size // num_classes
start_v, start_t = 0, 0
end_v, end_t = vsize_per_class, tsize_per_class
end_l = vsize_per_class+tsize_per_class
# for each pickled file in a group (training or validation), open and load each pickled file and randomize it,
# then pull out a fixed number from each class and put in in the merged dataset, and populate the corresponding
# label array
for label, pickle_file in enumerate(pickle_files):
try:
with open(pickle_file, 'rb') as f:
letter_set = pickle.load(f)
# let's shuffle the letters to have random validation and training set
np.random.shuffle(letter_set)
# first populate the validation dataset
if valid_dataset is not None:
# pull out a certain number of images from the randomized set
valid_letter = letter_set[:vsize_per_class, :, :]
# stick them into the dataset which will have data from all classes
valid_dataset[start_v:end_v, :, :] = valid_letter
# fill in the array of labels - all the same since from the same class"
valid_labels[start_v:end_v] = label
# increment the pointers to prepare for the next class
start_v += vsize_per_class
end_v += vsize_per_class
# now populate the training dataset
# start where the v_set left off, and pull out another set number of images from the randomized set, cont. as before
train_letter = letter_set[vsize_per_class:end_l, :, :]
train_dataset[start_t:end_t, :, :] = train_letter
train_labels[start_t:end_t] = label
start_t += tsize_per_class
end_t += tsize_per_class
except Exception as e:
print('Unable to process data from', pickle_file, ':', e)
raise
return valid_dataset, valid_labels, train_dataset, train_labels
train_size = 200000
valid_size = 10000
test_size = 10000
# create 1 dataset and 1 label array for each: validation, training, and testing
valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(
train_datasets, train_size, valid_size)
_, _, test_dataset, test_labels = merge_datasets(test_datasets, test_size)
print('Training:', train_dataset.shape, train_labels.shape)
print('Validation:', valid_dataset.shape, valid_labels.shape)
print('Testing:', test_dataset.shape, test_labels.shape)
def randomize(dataset, labels):
# permutes the order of the dataset and corresponding labels
permutation = np.random.permutation(labels.shape[0])
shuffled_dataset = dataset[permutation,:,:]
shuffled_labels = labels[permutation]
return shuffled_dataset, shuffled_labels
train_dataset, train_labels = randomize(train_dataset, train_labels)
test_dataset, test_labels = randomize(test_dataset, test_labels)
valid_dataset, valid_labels = randomize(valid_dataset, valid_labels)
Show data is still good after shuffling
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
imgplot = plt.imshow(train_dataset[20])
print(train_labels[20])
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Displays the merged/randomized data and the corresponding label (0 = A, etc)
********************************************
By construction, this dataset might contain a lot of overlapping samples, including training data that's also contained in the validation and test set! Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it. Measure how much overlap there is between training, validation and test samples.
Optional questions:
- What about near duplicates between datasets? (images that are almost identical)
- Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.
Question
# the training/validation data comes from NMNIST_large, and the testing data from NMNIST_small - why would there be
# overlap between these two sets? As for the overlap between training and validation, the dataset is randomized, and
# then validation is [0,1000] and train is [1000-21000] - again, why would there be overlap?
***************************************************
Let's get an idea of what an off-the-shelf classifier can give you on this data. It's always good to check that there is something to learn, and that it's a problem that is not so trivial that a canned solution solves it.
Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.
Optional question: train an off-the-shelf model on all the data!
Problem 6:
My first attempt was this:
lm = linear_model.LinearRegression()
lm.fit(train_dataset[:50], train_labels[:50])
But I got an error, it was expecting a 2D array and I gave it 3 values (as the images are a 2D array and the labels are 1D)
Off to search the Internet... returned with victory!
From an example working with images;
"To use this dataset with the scikit, we transform each 8x8 image into a vector of length 64"
I must therefore transform my 28x28 image into a vector of length 784.
QUESTION - what is this doing exactly?
data = train_dataset.reshape((train_dataset.shape[0], -1))
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
data = train_dataset.reshape((train_dataset.shape[0], -1))
lm = linear_model.LinearRegression()
lm.fit(data[:-50], train_labels[:-50])
QUESTION - what does the -50 notation mean?
After I run this, then I get:
QUESTION - what is this? I get the coef but not the intercept
print (lm.intercept_)
print (len(lm.coef_))
5.49250064066 784
lm.predict(data)[0:5]
print(train_labels[0:5]) [4 9 6 2 7]\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
I checked the model prediction vs the actual labels - some are close
Increasing the samples to 5000:
array([ 3.5022907 , 9.20476316, 4.30405357, 2.79587956, 5.07279521])
Not much better...
Also tried on valid dataset and test datasets, similar results
References:
http://www.scipy-lectures.org/advanced/scikit-learn/
http://bigdataexaminer.com/uncategorized/how-to-run-linear-regression-in-python-scikit-learn/
No comments:
Post a Comment