\title{\textbf{Achieving High Mean Accuracy with Semi-Supervised Learning using Small Number of Labeled Observations}\\[0.3in]}
\author{
        Submitted by: Sayed Waleed Qayyumi \\
               School of Computing, Engineering and Mathematics\\
        Western Sydney University\\
        Parramatta, NSW 2150, \underline{Australia}\\}
\date{\today}
\documentclass[12pt]{article}
\usepackage[toc,page]{appendix} 
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{lineno,hyperref}
\usepackage{multirow}
\usepackage[outdir=./]{epstopdf}
\usepackage[export]{adjustbox}
\usepackage{subfig}
\usepackage[margin=1in]{geometry}
\usepackage{siunitx}
\usepackage{array}
\usepackage{subfig}
\usepackage{color, colortbl}
\usepackage[section]{placeins}
\usepackage{newunicodechar}
\usepackage[utf8]{inputenc}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{algorithmic}
\usepackage{pdfpages}
\newcolumntype{L}[1]{>{\raggedright\arraybackslash}p{#1}}
\DeclareRobustCommand{\rchi}{{\mathpalette\irchi\relax}}
\newcommand{\irchi}[2]{\raisebox{\depth}{$#1\chi$}} % inner command, used by \rchi
\DeclareMathOperator*{\argmaxA}{arg\,max} % Jan Hlavacek
\begin{document}
\maketitle
\begin{center}
\textbf{PhD Candidature\\ CONFIRMATION REPORT}\\[0.5in]
\end{center}
\textbf{Supervisors:}\\
Dr Laurence Park\\
Associate Professor Oliver Obst\\[0.2in]
School of Computing, Engineering and Mathematics\\
        Western Sydney University\\
        Parramatta, NSW 2150, \underline{Australia}\\[1.6in]
\begin{center}
\textbf{Confirmation Report in Fulfilment of Confirmation of PhD Candidature}\\
\end{center}
\tableofcontents
\clearpage
\section{EXECUTIVE SUMMARY}
Classification is a field of machine learning that has to do with grouping data based on a specific criteria. This is also referred to as supervised learning. Technology has made data collection easy and with the abundance of data that is now available for usage, classification methods are becoming more important for decision making. While, we use complex algorithms to classify items, it is something that we humans do naturally. In machine learning, classification algorithms have very wide usage and can be applied to many daily life problems. Image classification, Music identification, Event classification and Speech tagging are some of the important application of classification algorithms. Many simple and complex classification algorithms such as Linear Regression, Perceptron, Naive Bayes Classifier, Decision Trees, Neural Networks and  Support Vector Machines are already used extensively in various fields of our daily life. While, these techniques have enriched our ways of work, there are many challenges that are still unsolved. High mean accuracy, availability of labeled data and dimensional distribution of data are some of the main challenges in this field. 
From a user point of view, achieving a high mean accuracy is always the first preference. The cost of using classification algorithms can only be justified if we achieve a high mean accuracy. For example, predicting a payment system failure for a bank, with high accuracy could save a bank big amount of money and human resource time but, the challenge in this type of problems is the availability of labeled data. There are many such scenarios where accurate classification or prediction is required but it is not achievable for various reasons. Similarly, it is not always possible to have many labeled data point for training classification model thus we require classification algorithms that can deal with small number of labeled data and returns a high mean accuracy. 
Another challenge with classification algorithms is the manifold assumption. Classification algorithms assume that data lies in the low manifold and hence fails to achieve high mean accuracy when used with a manifold distributed data. 
The objective of our research is to optimise or extend existing methods so that they can achieve high mean of accuracy using small number of labeled observations. We are also intending to introduce new methods that can classify a manifold distributed data with small number of labeled observations and still achieve high mean of accuracy. 
Our research will focus mainly on semi-supervised learning. To achieve the overall research objectives, three different categories of classification will be discussed in detail in this report. This will be followed by our research and contribution to the mentioned fields.
\begin{enumerate}
\item Supervised learning, which is classification of new observations using full labeled observations.
\item Unsupervised learning or clustering, is classification of observations that has no label information.
\item Semi-supervised learning, which is classification of new observation using limited number of labeled observations.
\end{enumerate}
The intention is to build over the existing state of art and where applicable come up with new ideas and algorithms.
\section{INTRODUCTION}
Classification is allocating objects in to classes and groups~\cite{dahlberg2015wissensorganisation}. Classification or supervised learning supposes a relationship $\mathrel{R}$ between elements of a set $E$. Classification remained the work of natural scientist for long and was used by them to classify different organism in to different classes. Libraries used various classification methods to classify books in to various groups. While classification was used by natural scientists in libraries, it got more attraction in 1960s when a Belgian and Polish mathematician~\cite{apostel1963probleme} published a paper on the subject of classification. This was followed by a book from Birkhoff~\cite{birkhoff1967lattice}. The Introduction of Numerical Taxonomy by P. H. Sneath and R. R. Sokal~\cite{sneath1973numerical} followed suit. Picking up momentum in 1970s, classification emerged as one of the most researched and valuable field of mathematics and statistics. Later, machine learning gave classification a new boost and introduced practical usage of classification outside mathematics and statistics. Now, classification has a very wide usage in our day to day life. Simple tasks as classification of product to product groups or complex tasks like comparing different traits of organism in taxonomy, use classification algorithms. Machine learning also introduced many new usages for these algorithms. For example, assigning an email to spam or not a spam class or diagnosing a patient's disease based on observed data from other patients. 
Classification is a form of pattern recognition and is also referred to as supervised learning. An algorithm that can classify is called classifier. While we have come a long way with respect to improving the classification algorithm's performance, we still have many challenges in this field. Mean accuracy of classification, unavailability or partial availability of labeled data and the manifold distribution of data are some of the challenges that we still need to resolve. Normally, a classifier's performance is partially dependent on the accuracy and availability of labeled data. The bigger the size of labeled training dataset, the better the classifier. Unfortunately, we do not have labeled data available in most of the scenarios or it is too expensive to label. Hence, there is a big demand for classifier that can classify using small number of labeled observation with a high mean accuracy. 
In machine learning, classification or supervised learning is about learning a mapping function between input and output variables. Imagine an input variable $X$ and an output variable $Y$, we map a function between input and output as per below.
\begin{equation}
Y = \hat{f} (X)
\end{equation}
The objective in here is approximation of the mapping function $f$. If the mapping function $f$ is well approximated; an output variable $Y$ can be predicted with high accuracy based on new input $X$. The process is called supervised learning as the mapping function is supervised to learn using training data. The learning only stops when the performance reaches a specific acceptable threshold. Supervised learning is divided in to two types; Classification and Regression. The difference between these two methods is in the output they predict. Classification is prediction of categorical variables while Regression is prediction of numerical variables. 
Contrary to supervised learning, in unsupervised learning there is no output variable $Y$. Input variable $X$ is used to understand the underlying data structure followed by clustering the same data in to groups.
In semi-supervised learning, we have input variable $X$ and very limited instances of output variable $Y$. Semi-supervised learning sits between supervised learning and unsupervised learning. The input variable $X$ and available label $Y$ is used alongside the unlabeled observations to determine a classifier function. A good example of semi-supervised learning is labeling photos in an archive. In such archives, some photos are labeled and majority are unlabeled. It is quite expensive and time-consuming to label unlabeled observations and hence most of machine learning algorithms fail in this type of classification. This is why semi-supervised learning is considered more challenging than supervised and unsupervised learning. 
Taking this as a challenge, our research and contribution will be mainly in the area of semi-supervised learning. Our focus will be to optimise existing techniques or introduce new semi-supervised techniques.
\section {LITERATURE REVIEW}
The literature review will begin with the review of current state of supervised, unsupervised and semi-supervised learning. This will be followed by a detail review of each mentioned method. As our research focus is semi-supervised learning, the semi-supervised section will include our research and contributions as well. We are specifically looking at the graph based approaches to semi-supervised learning. Research on graph based semi-supervised learning has shown promising results. These methods use a concept called random walk over graph. Graph based methods are good with classification of a manifold distributed data as well as classification using very small number of labeled observation due to the way they calculate the distance. The aim of studying graph based methods is to discover the reasons for graph based methods high mean accuracy using small number of labeled observations. Follow on from this study, focus will be mainly on image classification. Image classification is a challenging field of research. While image classification / identification is a very active area of research and convolutional networks have performed very well, there has been no major break throughs in this area as yet. 
   
\subsection{Supervised Learning}
\label {supervised}
In supervised learning, model learns from labeled datasets and then apply that learning to label a new data point. Supervised learning~\cite{bodo2008supervised} is divided in to two types i.e. Classification and Regression. The difference between classification and regression is the type of output. Classification predicts a class label of an unlabeled observation while regression predicts a numerical label. The objective is to come up with a mapping from $x$ to $y$ given a training set that is made up of $(x_{i},y_{i})$. $y_{i}$ is the label of $x_{i}$. The function that is created using training set can be evaluated through prediction of the same function on a test set. Supervised learning algorithms are generally divided in to two families i.e. generative and discriminative algorithms. The generative models~\cite{fujino2008semisupervised, kingma2014semi} learn the joint probability distribution $p(x,y)$ while discriminative models learn the conditional probability distribution $p(x|y)$ ~\cite{chapelle2009semi}.
In discriminative models, to predict label $y$ from a new observation $x$, we have to evaluate the class using below equation. The equation choses the most likely class for $y$ given $x$.
\begin{equation}
f(x) = \argmaxA_yp(y|x)
\end{equation}
We know from Bayes rule that $p(y|x) = \frac{p(x|y)p(y)}{p(x)}$. Replacing the $p(x|y)$ in the discriminative model equation with Bayes rule will result in to below equation. As the equation is only interested in $\argmaxA$, denominator is constant, and so can be disregarded when identifying the argmax.
\begin{equation}
f(x) = \argmaxA_yp(y|x)p(x)
\end{equation}
This is the equation used in generative model. In discriminative models, conditional probability distribution $p(x|y)$ is used to model the class boundary while the joint probability distribution models the actual distribution of each class. Given a label $y$, generative models can generate its respective $x$ and that is why they are called generative models.
\subsection{Unsupervised Learning}
Unsupervised learning helps in identification of unknown structures in data and does not require data to have labels. There are many different unsupervised methods available but the most common one is clustering. Clustering is to group or segment dataset based on common attributes or behaviours~\cite{larose2014discovering,tu2014novel}. Unsupervised learning is mainly used to estimate density in a dataset~\cite{dayan1999unsupervised}. As mentioned earlier, supervised methods use the information of conditional probability distribution $p(x|y)$ where $y$ is the label and $x$ is the input data; unsupervised learning works based on apriori probability distribution $p(x)$. 
\subsection{Semi-Supervised Learning}
\label{ssl}
Semi-supervised learning~\cite{cohen2004semisupervised} also called (SSL) is a type of learning that lies between supervised and unsupervised learning. Semi-supervised learning uses both labeled and unlabeled data with the intent to form higher accuracy models when compared to those using only labeled data~\cite{bodo2008supervised,cohen2004semisupervised}. Mathematically, it can be presented as, $X = (x_{i})_{i\in[n]}$ which is divided in to labeled data $X_{l} =(x_{1},....,x_{l})$ that has labels $Y_{l} = (y_{1},....,y_{l})$ and unlabeled data $X_{u} =(x_{l}+1,....,x_{l}+_{u})$ where no label information is provided. One of the questions that is normally asked regarding semi-supervised learning is, does using the labeled and unlabeled data provide additional accuracy when compared to using the labeled data alone? In short, the answer is yes but it is very strongly correlated to the distribution of labeled dataset and its relevance to the classification problem it is trying to solve~\cite{chapelle2009semi}. In other words, the $p(x)$ from unlabeled dataset should be useful in inference of $p(y|x)$ from labeled dataset for a semi-supervised method to perform better than the supervised method. If above is not the case, semi-supervised learning may degrade the mean accuracy of classification. Semi-supervised learning algorithms share some common assumptions~\cite{peikari2018cluster}. The methods can provide high accuracy only when these assumptions hold true. Some of the common assumptions used by semi-supervised learning is listed down below.
\begin{enumerate}
\item Smoothness Assumption: If two observations $x_{1}$ and $x_{2}$ are close to each other then so should be their respective labels $y_{1}$ and $y_{2}$.
\item Cluster Assumption: If two points $x_{1}$ and $x_{2}$ share the same cluster then most probably they have same class.
\item Manifold Assumption: High dimensional data lies on a low dimensional manifold.
\end {enumerate}
As mentioned earlier, most of the semi-supervised methods work only if all of the above mentioned assumptions hold true. Graph based methods, explained in section~\ref{graph} have the flexibility and capacity to avoid one or more of these assumptions and still achieve a high mean accuracy of classification. This is also our active area of research. Our existing work on this subject which is ready for publication is presented at the end of section~\ref{graph}.
\subsubsection{Self-Learning}
The idea of using unlabeled observations along with labeled observations to improve prediction is not new~\cite{mcclosky2006effective, tanha2017semi}. There was always a need to use the available unlabeled observations to enrich training or in other words improve mean accuracy of classification. This desire has led researchers to use the unlabeled observations in many different ways. Self-learning~\cite{thoits1985self,grimaudo2014select} is also referred to as self-training, self labeling in different texts~\cite{chapelle2009semi}, is a very simple approach to use unlabeled observations. Self-training is a wrapper algorithm that can use different supervised learning methods. Self-training is initiated using a labeled dataset and model is trained using the labeled dataset. The model is then applied to some of the unlabeled observations and label is predicted. The prediction is retained which is then used to train the model again and the process is repeated until all the unlabeled observations are labeled. The accuracy of self-learning is dependent on the supervised algorithm used and criteria adopted. The criteria could be either risk minimization~\cite{grandvalet2005semi} or margin maximization~\cite{gronlund2019optimal}. In margin maximization, the decision boundary is extended due to unlabeled observations. Self-learning does not guarantee an improvement in accuracy of the model and hence contradicts the whole objective of using unlabeled observations. We performed an experiment on Iris dataset to validate the fact that self-learning with margin maximization does not guarantee an improvement in classification mean accuracy. We choose $k$NN as the supervised method in our experiment. $k$NN algorithm forms a majority vote between the $k$ most similar observations and an unseen observation. Similarity is defined by a distance metric such as Euclidean distance. \\
kNN (dataset, sample)\\
\{
 \begin{enumerate}
 \item Go through each item in my dataset and calculate the distance from that data item to my specific sample.
 \item Classify the sample as the majority class between K samples in the dataset having minimum distance to the sample.
\end{enumerate}
\}\\\\
If we consider X as the matrix of features for an observation and Y is the class label, kNN looks at k (a positive integer) and estimates the probability of it belonging to class j for a test observation $x_0$
\begin{equation}
Pr(Y=j|X=x_0)=\frac{1}{k}\sum_{i\in \mathcal{N}_0}I(y_i=j)
\end{equation}
Self-learning $k$NN~\cite{cover1967nearest} involves predicting class of unlabeled observations. A model is trained using $k$NN and a training set TR. Unlabeled set UD is then labeled using the predicted class. The unlabeled points that have high confidence of prediction are then added to training set TR along with their predicted class. The extended training set ETR = TR+UD is then used to classify other unlabelled points. See Table \ref{tab:results} for the impact of adding unlabeled points on mean accuracy of prediction using $k$NN. Mean accuracy of prediction is calculated before and after the addition of unlabeled points. Mean accuracy is calculated by taking the mean of accuracy of random observations of 30 different iterations. It is very obvious from the results that self-learning may or may not improve mean accuracy~\cite{longstaff2010improving}.
\begin{table}
\caption{Impact of unlabeled data points on mean accuracy of classification - Iris dataset}
\label{tab:results}
\centering
\begin{tabular}{l l l l l }
\toprule
$\#_{TR}$ &$\phi_{TR} acc$ &$\#_{UD}$ &$\#_{ETR}$ &$\phi_{ETR}acc$\\
\midrule
40 &0.963& 20 &60 &0.944\\
20 &0.944& 20 &40 &0.973\\
25 &0.95& 25 &50 &0.95\\
15 &0.958& 15 &30 &0.925\\
30 &0.944& 30 &60 &0.944\\
40 &0.95& 30 &70 &0.975\\
\bottomrule
\end{tabular}
\end{table}
\subsubsection{Graph Based Methods}
\label {graph}
Research in graph based methods is highly active area in semi-supervised learning. A graph is constructed where each node in the graph is associated to a data point and edges between nodes exist if the associated pair of points are neighbours (with respect to some distance threshold). The edges of the graph are weighted to indicate the inverse distance (or similarity) between the associated two data points. The distance between any two points in the data is calculated by minimizing the path distance over all possible paths connecting the two points. Graph distance is used as an approximation of geodesic distance between two points on a manifold. Let's assume a graph $G(V,E)$ where $V$ is the set of graph vertices and $E$ is the set of graph edges. Weight is represented by $w(e_{ij})$ for each $e_{ij} \in E$. The edge weight $w(e_{ij})$ represents the local similarity between points $i$ and $j$. The weight matrix of this graph $G$ is defined by 
\begin{equation}
\text{$w(e_{ij})$ if $e_{ij} \in E, 0$ otherwise}
\end{equation}
As described in section~\ref{ssl}, semi-supervised learning is mostly useful when we have a high proportion of unlabeled observations. At times the cost of labeling observation is too high or requires a lot of effort.  Thus it makes sense to use unlabel observations. We need a high proportion of unlabeled observations, to be able to use them to increase accuracy of classification. This is due to the fact that unlabeled observations contain less information as compared to labeled ones. Processing of unlabeled observations is computationally expensive due to high volumes.  This in turn requires faster semi-supervised algorithms to process the large number of unlabeled data. Graph based algorithms use graphs to calculate the similarity between two data points. The graph creation process is computationally expensive~\cite{boykov2001interactive}. It is not feasible to create a graph for each labeled observation to identify the neighbourhood. Thus most of the graph based algorithms available today struggle with the cost of processing large volume of data~\cite{liu2006graph}. This means graph based methods are hard to implement and not efficient in real life problems. As part of solving this problem we have introduced a new method of using graph that is very light on compute cost while still takes advantage of graph based similarity calculation. We have called our method $cgk$NN which is explained in detail in section~\ref{cgknn}.
$cgk$NN works with small number of labeled observations and produces high mean accuracy with lower computational cost. Our method can be classified as a graph based semi-supervised learning method. Graph based algorithms build graphs whose nodes are labeled and unlabeled data points. Labeled data points are used to spread information to unlabeled data points. In these methods, graphs $g=(V,E)$ where $V$ represent a node and $E$ an edges is used to represent the geometry of the data. $E$ represents similarity between two edges. Similarities are shown using a weight matrix $W$.
\subsection{Image Classification}
Image classification refers to processes that are used to identify objects within an image or classify image in to a group. Huge amount of data is getting generated daily using image sharing apps and platforms. Companies are interested in getting value out of their huge repositories of digital data to deliver better and smarter services to their customers.
Image classification and recognition~\cite{cirecsan2012multi} is part of computer vision which is a very active area of research and is driving many fields. In automotive industry, Image recognition is now used for developing driverless cars. In gaming industry, image classification and recognition is used to develop the next generation of games. Facebook's famous face recognition algorithm can recognise face with 98\% accuracy. In short, image recognition and classification can drive automation in many industries. 
\subsubsection{Deep Learning}
Deep learning is one of the widely used method for Image classification~\cite{goodfellow2016deep}. While neural networks are covered in section~\ref{nn}, we are covering the part that is relevant to image classification. Similar to a brain structure, neural networks are a group of neurons that are connected to each other. Nodes within neural networks are referred to as neurons. Each neuron takes an input, process it and then passes its output to next neuron. The output of previous neuron is an input to the next layer of neurons till the final output is generated. Neural network algorithms are trained using training datasets which is a set of labeled images and then new images are feed to the same algorithm for classification. Computers store images as metrics and hence can compare them easily. Another approach will be to convert images to numeric data for comparison. There are two common methods to convert an image to numerical dataset. 
\begin{enumerate}
\item Greyscale: Image is converted to greyscale. Greyscale conversion means that the picture will be converted to shades of color from white to black. Computer then assigns a number to each pixel based on how dark the pixel is. Please refer to Figure~\ref{fig:example} for the conversion of image data to equivalent numerical data.
\item RGB: Colors can also be presented as RGB colors. Computers assigns a value to each pixel depending on the RGB value of each pixel. 
\end{enumerate}
Deep learning is often compared to human mind. There is an understanding that the field will continue to advance and enter many other domains. This has caused the fear that this may result in unemployment and even to slavery. There is no doubt that deep learning are efficient in performing tasks but they are not mean to solve all problems. There are many limitations and challenges with deep learning that prevents it from competing with other technologies. Gary Markus in~\cite{marcus2018deep} lists down the challenges with deep learning. He says ''In a world with infinite data, and infinite computational resources, there might be little need for any other technique''. We know well that we do not live in such world. We shall never end up with a labeled sample for every problem space in deep learning. Thus, we will have to generalise. Deep learning cannot learn from abstractions and only works best when there are labeled samples available. Deep learning has a heavy reliance on availability of correct and large number of labeled data. Gary also compares deep learning to a black box that learn correlation or patterns by shifting different data points and combinations. It is quite complex to decode the work of a neural network. This limits its usage in the fields where humans might want to know how a system makes a specific decision.  
\begin{figure}%
    \centering
    \subfloat[An image of number eight]{{\includegraphics[width=5cm]{8.PNG} }}%
    \qquad
    \subfloat[Greyscale equivalent]{{\includegraphics[width=5cm]{8N.PNG} }}%
    \caption{How computers translate images to greyscale}%
    \label{fig:example}%
\end{figure}
\subsection{Metric Learning}
Distance metric learning is a process whereby a learned metric is formed before it is feed to a classifier / predictor algorithm. A metric obeys four basic assumptions.
\begin{enumerate}
\item Non-negativity: $d(x,x^{\prime}) \geq 0$.
\item Identity: $d(x,x^{\prime}) = 0$.
\item Symmetry: $d(x,x^{\prime}) = d(x^{\prime},x) $.
\item Triangle Inequality: $d(x,x^{\prime\prime} ) \geq 0$.
\end{enumerate}
Large margin nearest neighbour LMNN~\cite{weinberger2006distance} can be stated as a convex optimisation problem. This capability enables LMNN to find a global solution efficiently. The objective of the algorithm is to learn a decision rule that can group data in to pre-defined classes. In LMNN, same class neighbours are put together while imposters or data points with different classes are pushed away from each other. Figure~\ref{LMNN} shows the metric learning process with LMNN.\\
$S=\{(x_{i},x_{j}) : Y_{i}=y_{j}$ and $x_{j}$ belongs to the k-neighbourhood of $x_{i}\},$
$R = \{(x_{i},x_{j},x_{k}) : (x_{i},x_{j}) \in S, y_{i} \neq y_{k}\}.$\\
\begin{figure}[htbp]
\centering
\includegraphics[,]{LMNN.PNG}
\caption{LMNN - Target neighbours and Imposters}
\label{LMNN}
\end{figure}
\subsection{Neural Networks}
\label{nn}
Neural networks are group of algorithms that are designed based on the structure and functionality of neurons in human brain. Neural networks are mostly used for pattern recognition. Neural networks only accept numerical data only. Images, sound and text data needs to be translated to numerical data before it can be used by a neural network algorithm. Please refer to Figure~\ref{fig:example} for the conversion of image data to equivalent numerical data. The main usage of Neural network is in the field of supervised and unsupervised learning but they can also be used to extract features. Neural networks are made up of multiple layers of nodes. A node in different layer will turn on or off depending on the input it receives. Figure~\ref{NN} shows the structure of a node within a Neural network. A node is made up of an inputs, weights, input function, activation function and an output.
\begin{figure}[htbp]
\centering
\includegraphics[width=10cm]{NN.PNG}
\caption{Structure of a node in a Neural network}
\label{NN}
\end{figure}
As mentioned earlier, a layer in neural network is a row of nodes. Layer are classified in to three types; an input layer, a hidden layer and an output layer. Each layer has an input and output. Each layer's output is an input for the next layer until a final output is produced. Figure~\ref{NN1} shows layers of a neural network and how they are connected. Neural networks have wide usage in different industries. Speech recognition~\cite{choon2004lightweight}, computer vision~\cite{cho2018deep}, Pattern recognition~\cite{narendra1990identification} and financial crime detection~\cite{omolara2018state} are some of the areas where neural network helped solved many problems.
 
\begin{figure}[htbp]
\centering
\includegraphics[]{NN1.PNG}
\caption{Layers of a Neural network}
\label{NN1}
\end{figure}
Figure~\ref{NN3} shows artificial neural network classified in to different types~\cite{cho1995combining}. There is two main classes of artificial neural networks (Feed Forward NN, Feed backward NN). In FFNN, information transmits in one direction. Information is feed from input node and then passed to hidden node and finally passed on to output node~\cite{hagan1994training}. In feedback NNs, backpropagation between nodes produced a coordinated graph in sequence. 
\begin{figure}[htbp]
\centering
\includegraphics[width=12cm]{nntypes.PNG}
\caption{Classification of Artificial Neural Networks}
\label{NN3}
\end{figure}
\subsection{Reducing Dimensions}
Dimension reduction refers to the techniques of reducing the dimensionallity of data while minimising any changes to the data. With the advancement of technology, we are now able to capture more data than ever. This has helped decision makers to make effective data driven decisions. While, availability of data has helped, it has come with a curse. Unwanted data or variables that do not add much value while analysts are analysing data for a specific problem, can result in unnecessary complication. In other words, analysing many variables is not easy task and hence it is a value add exercise to analyse only a subset of relevant variables with respect to problem in hand. With the increase in data capture activities and the availability of different types of data such as images, videos, apps within smart phones, social media data or sentiments, it is very important we reduce data before analysing it. This will help in understanding data and factors influencing a specific objective. There are many dimension reduction techniques available and this area is also an active research field. Figure~\ref{DR} shows a three dimensional dataset getting converted to a two dimensional dataset with minimal loss of information. The main objectives of each technique is to reduce dimensions with minimal information loss. All techniques available for dimension reduction are group in to two categories i.e. feature selections or feature reductions techniques~\cite{carreira1997review}.
Some of the common methods are listed below.
\begin{enumerate}
\item Low Variance: Variance based methods, remove the dimensions that have low variance with respect to other methods.
\item High correlation: Variables that have high correlations also does not add much value. Instead one can use just one of such highly correlated variables. Correlation matrix of all variables can identify all variables with high correlations.
\item Backward Feature Elimination: In this technique sum of square error is calculated after eliminating each variable. The objective is to find the variable with smallest increase in sum of square error.
\item Forward Feature Selection: In this technique, we compare the performance of model by adding one variable at a time. Variables with highest improvements in performance is selected.
\item Factoring: Correlated variables are grouped together to create a single variable from two or more correlated variables. Methods such as Exploratory or Confirmatory factor analysis can be used.
\item Principal Component Analysis: Existing variables are transformed to new variables which are linear combination of original variables. The components are created in such a way that the first component accounts for most of the possible variations of original data. The second component is orthogonal to the first component. This is to capture the variation not captured by the first component.
\end{enumerate}
\begin{figure}[htbp]
\centering
\includegraphics[width=10cm]{DR.PNG}
\caption{A 3D to 2D conversion - Locally Linear Embedding}
\label{DR}
\end{figure}
The field of semi-supervised learning encapsulates many concepts and methods. To review each of these concepts in details, one requires more time. The intention is to extend this literature review in next two years as we solve our research problem. 
\section{WORK IN PROGRESS}
\subsection  {Constrained Graph kNN}
\label{cgknn}
$k$ Nearest Neighbours $k$NN has been widely used for classification owing to its simplicity and accuracy. In spite of the wide usage, $k$NN has performed poorly with a manifold distributed data. Few extensions of $k$NN are available that deal with the manifold distributed data but the extensions are either costly to compute or improvement in term of mean accuracy of classification is not significant~\cite{tu2016graph}. Classification of a manifold distributed data requires prior knowledge of the shape of data distribution. The shape information is used to assess a distance or similarity function. To help resolve some of these problems, we are introducing a new semi-supervised algorithm which we are referring to as constrained graph $k$NN ($cgk$NN). Our method can be used for traditional Gaussian distributed data classification as well as for a non-linear manifold distributed data classification. Inspired by a method called manifold $k$NN ($mk$NN) which is one of the best method for classification of a manifold distributed data as well as classification using small number of labeled observations, our method works in similar way but always outperforms $mk$NN with respect to compute time. Our method also performs as good as, and at times better than $mk$NN with respect to mean accuracy of classification and classification using small number of labeled observations.
 
Graph based classification methods are more accurate than traditional methods at classification of a manifold distributed data and have shown higher mean accuracy of classification~\cite{tu2016graph, tu2014novel}. $k$NN~\cite{cover1967nearest,wu2008top}, k-means~\cite{tu2014novel} and many other algorithms have been extended using graphs to deal with a manifold distributed data. To understand why graph based methods perform better with a manifold distributed data, we need to answer two important questions.
 \begin{enumerate}
\item What is a manifold distributed data?
\item What is random walk?
\end{enumerate}
Definition: A data set is considered to be a manifold distributed if its intrinsic dimension is less than its data space dimension \cite{tu2016graph}. A manifold is very different from dimensionality. Dimensionality refers to the number of variables or coordinates used to represent the data. The shape in Figure~\ref{swissroll} shows a two dimensional manifold represented in three dimensional space. Any point on the manifold can be represented with two coordinates, but without knowledge of the manifold, they must be represented using three coordinates. In short, we define a manifold as a continuous geometrical structure that has fixed number of dimensions. The number of intrinsic dimension of a manifold distributed data is less than its data dimension. 
\begin{figure}[htbp]
\includegraphics[width=120mm,height = 80mm,scale=0.5]{manifolddata.pdf}
\caption{A 2D manifold with 3D data points }
\label{swissroll}
\end{figure}
As mentioned earlier, traditional classification algorithms struggle to classify a manifold distributed data~\cite{zhong2016overview}. $k$NN, which is an effective and simple classification algorithm to classify normal distributed data, also fails to classify a manifold distributed data with high mean accuracy. The main reason for this limitation is the distance function used. Distance functions cannot calculate the true distance between various points of a manifold distributed dataset and thus results in high error rate. Distance metrics for the three dimensional data will calculate distances within the three dimensional space, regardless of the manifold space; ideally distances should be calculated along the manifold. Graph based methods can calculate the distance between two points on a manifold distributed dataset quite accurately~\cite{zinovyev2013data,han2011data} and they can also use other available information such as class labels of the labeled data points. To demonstrated the limitation of $k$NN and strength of graph based classification, we have provided a synthetic dataset of a cone shape with class boundary of $x = 0$. The cone shaped synthetic dataset can be generated as per below equations. 
%\begin{equation} X = Y = seq (-1,1, len =10) \end{equation}
\begin{equation}Z = \sqrt{X^2 + Y^2}\end{equation}
The dataset contains three variable $X,Y,Z$. $Z$ is a function of $X,Y$. Figure~\ref{figurpic:1} shows the graphical representation of the dataset. We have divided the dataset in to two classes (black and red). An unlabeled point is shown in green in Figure~\ref{figurpic:2}. This unlabeled data point is part of the black class but sits very close to red class. Figure~\ref{figurpic:3} shows classification by $k$NN of this unlabeled data point with $k=14$. A $k=14$ is chosen for clarity  only. All blue coloured data points are the nearest neighbours calculated by Euclidean distance used by $k$NN. It is obvious from the Figure~\ref{figurpic:3} that $k$NN choses the nearest data point with respect to Euclidean distance and hence chose any point within the data space. $k$NN classification in this scenario is inaccurate. Figure~\ref{figurpic:4} shows the same dataset classified with $cgk$NN. It is obvious from the nearest neighbours (blue points), that $cgk$NN picks up the relevant data space by considering the class information and by using graph for random walk as it identifies the local neighbourhood. The walk on this graph is what we refer to as random walk.  
\begin{figure}[htpb]
  \centering
\caption{How cgkNN works on a manifold distributed data}
 \label{figurpic}
  \subfloat[3D view of a cone]{\label{figurpic:1}\includegraphics[width=70mm,height=70mm]{Pic4.pdf}}
  \subfloat[3D view of a cone data with two classes and a unlabeled data point]{\label{figurpic:2}\includegraphics[width=70mm,height=70mm]{pic1.pdf}}
  \\
  \subfloat[14 nearest point to unlabeled point -kNN]{\label{figurpic:3}\includegraphics[width=70mm,height=70mm]{pic2.pdf}}
  \subfloat[14 nearest point to unlabeled point -cgkNN]{\label{figurpic:4}\includegraphics[width=70mm,height=70mm]{pic3.pdf}}
\end{figure}
\begin{table}
\caption{k closest neighbours by weight}
\label{tab:kclosest}
\centering
\begin{tabular}{l l l l l l}
\toprule
Closest k &1 & 2 &  3 & 4 & 5\\
\midrule
Obsrv No &12 &10&2&19&11\\
weight & 7.006982e-4 &7.002936e-4&6.998980e-4&6.990813e-4&6.988739e-4\\
\bottomrule
\end{tabular}
\begin{tabular}{l l l l}
\toprule
Class &0 & 1\\
\midrule
Sum of weight &2.1e-3 &1.4e-3\\
\bottomrule
\end{tabular}
\end{table}
Constrained Graph $k$NN ($cgk$NN) is an extension of $k$NN whereby we use tired random walk to identify nearest neighbours. The method can be applied to labeled as well as unlabeled datasets. While our method outperforms many known methods for labeled datasets it is among the best methods for classification of datasets with many unlabeled and few labeled observations. The experimental results shown compares our method with the best available. We use an approach called tired random walk also called constrained random walk to measure the distance \cite{tu2016graph}. We have used the class labels of the labeled data for distribution as well as constrained information. The constrained information is used to modify the weight of graph edges between labeled samples. While our algorithm is explained in detail later with an example, a step by step procedure is outlined below.
\subsubsection{The algorithm}
\begin{enumerate}
\item Input $\rchi = \rchi_{T} \cup \rchi_{U} , y $ and $k$, where $ \rchi_{T}$  is labeled data, $ \rchi_{U}$ is unlabeled dataset, $y$ is a class label, $k$ is number of nearest neighbours, $\sigma$ is the Gaussian smoothing, $\alpha$ is the strength reduction rate.
\item Create a graph adjacency matrix $W$ of dataset $\rchi$ as per below.
\begin{enumerate}
\item $W_{ij} = 1$ if $x_{i}, x_{j}  \in  \rchi_{T}$ and have same class label.
\item $W_{ij} = 0$ if $x_{i}, x_{j} \in \rchi_{T}$ and have different class labels.
\item $W_{ij} =exp(-\|{x_{i} - x_{j}}\|^2/2\sigma^2)$ if at least one of $x_{i}, x_{j}$ is unlabeled.
\end{enumerate}
\item Calculate transition matrix $P$ as per below. $D$ is a diagonal matrix.
\begin{equation}
P = D^{-1}W
\end{equation}
\item Compute the $P_{TRW}$ (tired random walk transition matrix) using equation 
\begin{equation}
 P_{TRW} = \sum_{t= 0}^\infty (\alpha P)^t = (I -\alpha P)^{-1} 
\end{equation}
\item Evaluate sample's similarity using equation
\begin{equation}
\bar{w}_{ij} = w(x_{i},x_{j}) = \frac{(P_{TRW})_{ij} + (P_{TRW})_{ji}} {2}
\end{equation}
\item Find $k$ nearest neighbours of an unlabeled sample using equation
\begin{equation}
x_{i} =\underset{x_{j} \in \rchi_{T}}    {\mathrm{arg\,max }}\,  w(x,x{j})
\end{equation} 
\item Determine the class label using equation. $C$ is the class label and $x$ is a sample observation.
\begin{equation}
y = \underset{c=1,2,....,C}{\mathrm{arg\,max }} \sum^k_{i=1} w(x,x_{i})I(y_{i} = c)
\end{equation}
\end{enumerate}
In tired random walk, transition probability of a walker reduces with a fixed proportion $\alpha$. In our algorithm we have fixed this ratio to 0.01 and thus the transition probability becomes smaller after a fixed number of steps. Please see~\cite{tu2016graph} for more details on why tired random walk performs better than traditional random walk. $\alpha$ forces tired random walk to identify local neighbourhood. Another common approach for a manifold distributed data classification is the use of strengthening trees. Strengthening trees are used to reinforce strong relationship and reduce the weaker ones. We have avoided using any strengthening mechanism.  We have conducted many experiments to evaluate the cost and benefit of using such strengthening mechanism and have found that such mechanisms do not add much value but have unnecessary overhead in term of additional computational cost. This is one of the reason why our method works much faster than the one mentioned in~\cite{tu2016graph}. Please refer to Figure~\ref{timebydataset} for difference in compute cost between $mk$NN and $cgk$NN. The main cost of using strengthen trees is the compute time. Algorithms that use strengthen trees as a strengthening mechanism build these trees for each labeled observation in training dataset. Thus, their processing cost increases as the number of labeled observations increases. This makes the approach presented in~\cite{tu2016graph} called $mk$NN only applicable to datasets with few labeled and many unlabeled observations. This also makes this approach undesirable for a dataset that has many labeled observations. In other words, the method will not take advantage of many labeled observations and restrict itself to $k$ observations per class.  As strengthen trees are required for each labeled observation, $mk$NN restrict itself to very small number of labeled observation. The algorithm struggles to compute classification of an average size dataset if decent number of observations were selected per class and not $k$ labeled observations.
Most of the methods that can classify a manifold distributed data use random walks over a graph to identify nearest neighbours. Random walk and creating trees to identify nearest or furthest neighbours is a costly activity. It is not worth using these methods unless the gain outweighs the additional cost. Algorithms that use trees ultimately reduces processing cost by compromising other cost intensive processing activities. Like in~\cite{tu2014novel,tu2016graph} the cost is minimized by selecting only $k$ labeled points per class for training. This is the same $k$ in $K$ nearest neighbour classification. Obviously, the number is limited to $k$ to keep the cost of creating trees to minimal. Similar approaches are adopted by different algorithms whereby creation of trees is kept to minimal to avoid costly processing. 
In short, traditional $k$NN struggles with classification of a manifold distributed data. Graph based approaches can classify a manifold distributed data but have very high compute cost. Our method can be described as best of both i.e. high mean accuracy and less compute cost for classification using graph.
\subsubsection {Example}
The dataset used in this example is a subset of banknotes dataset. The dataset contains twenty labeled observations, ten from each class and one unlabeled observation.  We have intentionally selected a subset that will result in incorrect classification by $k$NN and correct classification by $cgk$NN. The objective is to show how traditional $k$NN can misclassify a manifold distributed data based on Euclidean distance and how the same observation is classified correctly using $cgk$NN. Banknotes dataset has two classes i.e. 1 or 0 and hence our example also has two classes. Figure~\ref{figurtest} shows the dataset from two different angles. It also shows the nearest neighbours as classified by Euclidian distance of $k$NN and also by $cgk$NN. The yellow point is the unlabeled data point while the aqua colour points represent the nearest neighbours. Blue colour represents class $0$ while pink color represents class $1$. The nearest neighbours are either class $0,1$. You can see that observations ${2,9:12}$ are calculated as the closest five observations to the unlabeled observation using Euclidian distance. The same dataset returns different nearest neighbours with $cgk$NN. The nearest neighbours as per $cgk$NN weightage are observations ${2,10:12,19}$. As the data points are in three dimensional space, we have shown two different views of the same for each method. The correct calculation of one nearest neighbour results in correct classification of the unlabeled point. $k$NN classify this unlabeled point as class 1 while $cgk$NN classify this unlabeled point as class 0 which is correct classification.
\begin{figure}[htpb]
  \centering
\caption{Location of unlabeled point and five nearest neighbours - kNN and cgkNN}
 \label{figurtest}
  \subfloat[kNN]{\label{figurtest:1}\includegraphics[width=50mm]{plot1knn.pdf}}
  \subfloat[kNN]{\label{figurtest:2}\includegraphics[width=50mm]{plot2knn.pdf}}
  \\
  \subfloat[cgkNN]{\label{figurtest:3}\includegraphics[width=50mm]{plot1cgknn.pdf}}
  \subfloat[cgkNN]{\label{figurtest:4}\includegraphics[width=50mm]{plot2cgknn.pdf}}
\end{figure}
Figure~\ref{figurcgknn} shows different views of how $cgk$NN classify this unlabeled points. It not only calculates the nearest points accurately; it uses the class information to drive a better outcome. The Final weight matrix based on tired random walk of the mentioned dataset is shown in Table~\ref{tab:kclosest}. Using this matrix $cgk$NN calculates the sum of weights and thus assigns the unlabeled data point to the class with highest sum of weights.
\begin{figure}[htp]
  \centering
  \caption{Different views of how cgkNN cluster various classes and then classify}
  \label{figurcgknn}
  \subfloat[]{\label{figurcgknn:1}\includegraphics[width=50mm]{cgknn1.pdf}}
  \subfloat[]{\label{figurcgknn:2}\includegraphics[width=50mm]{cgknn2.pdf}}
  \\
  \subfloat[]{\label{figurcgknn:3}\includegraphics[width=50mm]{cgknn3.pdf}}
  \subfloat[]{\label{figurcgknn:4}\includegraphics[width=50mm]{cgknn4.pdf}}
\end{figure}
As mentioned in the algorithm, the nearest points are calculated using similarity matrix $ \bar{w}_{ij} = w(x_{i},x_{j}) = \frac{(P_{TRW})_{ij} + (P_{TRW})_{ji}} {2}$. Once a similarity weight matrix is calculated, nearest k labeled points are identified along with their classes. This is followed by sum of weight operations by class. The class with highest sum of weights wins and unlabeled point is labeled with the same class.
\begin{figure}[htp]
  \centering
  \caption{k closest observations to unlabeled observation}
\label{figurcgfplot}
  \subfloat[]{\label{figurcgfplot:1}\includegraphics[width=50mm]{FPlot.pdf}}
  \subfloat[]{\label{figurcgfplot:2}\includegraphics[width=50mm]{FPlot2.pdf}}
\end{figure}
Table~\ref{tab:kclosest} also shows sum of weights by class for this specific example. The sum of weight for class $0$ is higher than the sum of weight for class $1$. Thus the algorithm classifies the unlabeled observation to class $0$.
\begin{figure}[htbp]
\includegraphics[width=14cm,height=10cm]{ComparecgkNN.pdf}
\caption{Mean error rate comparision of cgkNN and mkNN}
\label{compare}
\end{figure}
\begin{figure}[htbp]
\includegraphics[width=14cm,height=19cm]{Comparen.pdf}
\caption{Mean error rate of five real-world data sets with different number of training sets. Mean error rate on the ordinate and k on the abscissa.}
\label{knnbydataset}
\end{figure}
\begin{figure}[htbp]
\includegraphics[width=16cm,height=13cm]{Pic6.pdf}
\caption{Compute time of five real-world data sets with different number of training sets. Compute time on the ordinate and Sample size on the abscissa.}
\label{timebydataset}
\end{figure}
\begin{table}[ht]
\centering
\begin{tabular}{@{\extracolsep{4pt}}llrrrrrrr}
\toprule   
{} & {} & \multicolumn{2}{c}{$\phi$ Error}  & \multicolumn{2}{c}{Processing time}& \multicolumn{2}{c}{Difference}\\
 \cmidrule{3-4} 
 \cmidrule{5-6} 
 \cmidrule{7-8} 
  Dataset & \#TR & $cgkNN$ & $mkNN$ & $cgkNN$ & $mkNN$ & $\phi$ Error & Time \\ 
\midrule
Banknotes&20&24.4&23.8&1,538&1,873&0.6&-335\\
Banknotes&200&6.4&6.1&2,532&2,944&0.3&-412\\
Banknotes&250&5.6&5.6&4,493&5,307&0&-814\\
Banknotes&300&5.1&5.2&4,044&4,782&-0.1&-738\\
Banknotes&400&4.3&4.2&3,930&5,220&0.1&-1,290\\
Multifeature&100&38.5&38.5&7,055&9,829&0&-2,774\\
Multifeature&200&29.1&29.3&15,975&18,466&-0.2&-2,491\\
Multifeature&300&24.8&24.6&16,773&19,782&0.2&-3,009\\
Pendigits&500&5.6&6.1&17,084&24,234&-0.5&-7,150\\
Pendigits&600&5.2&5.3&44,068&49,194&-0.1&-5,126\\
Pendigits&900&4.4&3.8&53,000&67,857&0.6&-14,857\\
Segmentation&70&27.9&27.9&13,908&15,570&0&-1,662\\
Segmentation&420&16.4&16&21,742&26,178&0.4&-4,436\\
Segmentation&875&12.4&12.3&28,541&32,940&0.1&-4,399\\
Statlog&180&22.6&17.9&4,109&6,445&4.7&-2,336\\
Statlog&240&17.4&16.8&5,456&7,053&0.6&-1,597\\
Statlog&360&16.6&14.9&6,252&9,166&1.7&-2,914\\
\bottomrule
\end{tabular}
\caption{Mean accuracy of prediction and processing time mkNN vs cgkNN} 
\label{tab:ResultsAvgAccuracy}
\end{table}
\subsubsection {Experimental results}
\label{results}
We have performed experiments on six publically available datasets. These are the same datasets used by~\cite{tu2016graph}. The objective was to compare computational cost and mean accuracy of prediction for $cgk$NN and $mk$NN. Mean accuracy is calculated using result from ten different iterations with random samples in each iteration. During experiments we noticed that the mean accuracy of predictions using traditional $k$NN declines as the number of k increases. This is quite an obvious pattern in all of the mentioned datasets.
Please refer to Figure~\ref{compare} to see the comparison of mean accuracy by $mk$NN and $cgk$NN for various datasets. Banknotes dataset is compared twice but with different number of labeled observations. It is very clear from Figure~\ref{compare} that  mean accuracy of $cgk$NN is always in line with $mk$NN irrespective of number of labeled observations available. Table~\ref{tab:ResultsAvgAccuracy} shows mean accuracy for various datasets with different size of training. It should be noted that while $cgk$NN performs as good as or better than $mk$NN with respect to mean accuracy of prediction, it is is much faster than $mkk$NN. Table~\ref{tab:ResultsAvgAccuracy} shows the processing time for various datasets. It is obvious from the results that $cgk$NN can produce same results as $mk$NN but with almost half the time required for computation. It should be noted that these calculations are based on same datasets and all environmental variables such as network, processing power were keep the same to get a comparable result.
You would notice that the mean accuracy of classification of all these datasets increases if we consider a good size training dataset. This may not be achievable in scenarios where we have limited labeled observations.
Figure~\ref{knnbydataset} shows the mean error rate of various datasets for k of 1 to 10 using $mk$NN and $cgk$NN. Figure~\ref{knnbydataset} also compares the result for different number of training observations from the same dataset. It is obvious from Figure~\ref{knnbydataset} that the performance of both methods are in  line with each other irrespective of number of training observations selected. On the other hand, Table~\ref{tab:ResultsAvgAccuracy} shows a comparison of our method to $mk$NN with respect to mean accuracy and processing time. The results are based on selection of various number of training observations per class. You can see from results that while mean accuracy remains the same for both the methods, our method is much efficient on processing time. Processing time of $mk$NN gets worse as the number training observations increases. The main reason for this is the usage of strengthen trees in $mk$NN which are required for all the labeled training observations. Thus the processing time increases as we increase number of observations. Figure~\ref{timebydataset} shows a comparison of compute time for various samples of the five datasets for $mk$NN and $cgk$NN. It is visible that $cgk$NN performs much better with respect to compute time when compared to $mk$NN.
\section{RESEARCH AIM AND OBJECTIVES}
\subsection{Overall Research Aim}
This research will examine different methods of semi-supervised learning with specific emphasis on a manifold distributed data. The research will also examine best approaches to semi-supervised learning with limited number of labeled observations. New methods to semi-supervised learning and optimisation / extension of existing methods are also in scope.
\subsection{Research Objectives}
The objectives of this research are as follows:
\begin{enumerate}
\item Part 1: Below areas will be covered.
\begin{enumerate} 
\item Examine and explore graph based methods for classification and clustering. Explore and explains why random walk works in these methods.
\item Examine and explore the application of multi-dimensional scaling in hierarchical and a manifold clustering.
\item Examine and explore semi-supervised learning for image classification, especially work done by Facebook.
\end{enumerate} 
\item Part 2: Below areas will be covered.
\begin{enumerate} 
\item Examine and explore distance metric learning and classification. This will be a thorough review of work done by Kilian Weinberger including Large Margin nearest neighbour LMNN.
\item Examine and explore self-supervised learning in context of the work done by google research. 
\end{enumerate} 
\item Part 3: Contributions
\begin{enumerate}
\item Development of a graph based classification algorithm for a manifold distributed data and unlabeled data. Submission of the same to a relevant journal.
\item An article on why random walk or tired random walk works. Submission of the same to a relevant conference.
\item Development of new algorithm using graph based methods for clustering. Submission to a relevant journal. 
\item Extension of existing graph based algorithms and potential paper submissions to a relevant conference.
\item Development / extension / optimisation of an image classifier. Submission of the same to a journal.
\end{enumerate}
\end{enumerate}
\subsection{Research Questions}
The following research questions will be applied to this area of study:
\begin{enumerate}
\item What is the most effective way of classifying a manifold distributed data? 
\item What is the most effective way of labeling data with small number of labeled and large number of unlabeled observations?
\end{enumerate}
Most of the next two years of research will revolve around answering these two questions. 
\section{PROPOSED RESEARCH SCHEDULE}
A Gantt timeline chart was designed to establish rough timeline of all activities for next three years. This is to ensure all activities are completed within the agreed time frame. Please refer to Appendix~\ref{PRC} for Gantt Chart and Schedule. To date, timeframes allocated for various tasks have been met, and we are progressing without any difficulty. 
\section {CONTRIBUTION OF PROPOSED RESEARCH}
\subsection{Benefits of Research}
This research will examine how different types of semi-supervised learning behave with small number of labeled observations. The focus will be to optimise algorithms to work with small number of labeled observations and produce high mean accuracy of classification. The research will also be focusing on classification of a manifold distributed data and data with high number of dimensions while, concurrently, investigating the semi-supervised algorithms for breast cancer image classification. This is in order to address the gaps in classification accuracy of medical images.
This research will also examine work done by Facebook and google in the field of semi-supervised learning and image classification with a view to improve and build on existing work. 
\subsection{Communication of Results}
Results will be communicated mainly via publications in journals. Plans for five papers for submission to relevant journals are outlined below in Table~\ref{tab:papers}.
\begin{table}[]
\centering
\begin{tabular}{@{}lll@{}}
\toprule
Paper & Title/Content & Respective Journal \\ \midrule
 1     &   Constrained Graph Nearest Neighbour Classification            &        Elsevier            \\
 2   &     Probabilistic Nearest Neighbours Classification for Unlabeled Data          &  AI2019                  \\ 
 3   &     Clustering using constrained Graph Nearest Neighbours          &  	     Elsevier             \\ 
 4   &     Semi-supervised learning for breast cancer image classification          &  TBA                  \\ 
 5   &     Semi-supervised learning with metric learning          &  TBA                  \\ 
 
 
\bottomrule
\end{tabular}
\caption{Outline of Proposed Papers for Publication} 
\label{tab:papers}
\end{table}
\section{CONCLUSION }
Achieving higher mean accuracy of classification can add tremendous value to tasks that use various classification algorithms. The cost of labeling unlabeled observations is increasing with the increase in cost of human resources. Therefore, it is important that effective semi-supervised learning algorithms are developed that can label unlabeled observations and can achieve high mean accuracy with minimal cost of human resource. 
This report has highlighted the potential for further research in this area. To develop new or optimise the existing algorithms for classification that can achieve high mean accuracy of classification irrespective of the type of data used, is the basic objective of this project. 
\nocite{*}
\bibliographystyle{abbrv}
\bibliography{bib}
\appendix
\section{Appendix - Copy of Journal Article Ready for Submission}
\includepdf[pages=-]{paper2}
\section{Appendix - Copy of Journal Article Sent for Peer Review}
% the \\ insures the section title is centered below the phrase: Appendix B
\includepdf[pages=-]{paper1}
\section{Appendix - Proposed Research Schedule}
\label{PRC}
\begin{figure}[htbp]
\centering
\includegraphics[]{Schedule.pdf}
\caption{Proposed Research Schedule}
\label{PRS}
\end{figure}
\listoftables 
\listoffigures
\end{document}
This is never printed