**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Selection algorithm

Summary

In computer science, a selection algorithm is an algorithm for finding the kth smallest value in a collection of ordered values, such as numbers. The value that it finds is called the kth order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of n values, these algorithms take linear time, O(n) as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time O(1).
Problem statement
An algorithm for the selection problem takes as input a collection of values, and a number k. It outputs the k

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related publications (25)

Related people (1)

Loading

Loading

Loading

Related units (1)

Related concepts (11)

Sorting algorithm

In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascendin

Quicksort

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for

Heap (data structure)

In computer science, a heap is a specialized tree-based data structure that satisfies the heap property: In a max heap, for any given node C, if P is a parent node of C, then the key (the value) of P

Related courses (12)

CS-422: Database systems

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as read and understand recent research publications.

CS-322: Introduction to database systems

This course provides a deep understanding of the concepts behind data management systems. It covers fundamental data management topics such as system architecture, data models, query processing and optimization, database design, storage organization, and transaction management.

CS-233(b): Introduction to machine learning (BA4)

Machine learning and data analysis are becoming increasingly central in many sciences and applications. In this course, fundamental principles and methods of machine learning will be introduced, analyzed and practically implemented.

Related lectures (19)

This thesis focuses on designing spectral tools for graph clustering in sublinear time. With the emergence of big data, many traditional polynomial time, and even linear time algorithms have become prohibitively expensive. Processing modern datasets requires a new set of algorithms for computing with extremely constrained resources, i.e., \emph{sublinear algorithms}. Clustering is one of the well-known techniques for solving large-scale optimization problems in a wide variety of domains, including machine learning, data science and graph analysis~\cite{aydin2016distributed, rolnick2016geocuts, gargi2011large}. Efficient sublinear solutions for fundamental graph clustering problems require going well beyond classic techniques. In this thesis, we present an \emph{optimal} sublinear-time algorithm for \textit{testing $k$-clusterability problem}, i.e., quickly determining whether the graph can be partitioned into at most $k$ expanders, or is far from any such graph. This is a generalization of a well-studied problem of testing graph expansion. The classic results on testing $k$-clusterability either consider the testing expansion problem (i.e, $k=1$ vs $k\geq 2$) \cite{KaleS_SIAMJC11,NachmiasS10}, or address the problem for larger values of $k$ under the assumption that the gap between conductances of accepted and rejected graphs is at least logarithmic in the size of the graph \cite{CzumajPS_STOC15}. We overcome these barriers by developing novel spectral techniques based on analyzing the spectrum of the Gram matrix ofrandom walk transition probabilities. We complement our algorithm with a matching lower bound on the query complexity of testing $k$-clusterability, which improves upon the long-standing previous lower bound for testing graph expansion.Furthermore, we extend our previous result from the \textit{property testing} framework to an efficient clustering algorithm in the \textit{local computation algorithm} (LCA) model. We focus on a popular variant of graph clustering where the input graph can be partitioned into $k$ expanders with outer conductance bounded by $\epsilon$. We construct a small space data structure that allows quickly classifying vertices of $G$ according to the cluster they belong to in sublinear time. Our spectral clustering oracle provides $O(\epsilon \log k)$ error per cluster for any $\epsilon \ll 1/\log k$. Our main contribution is a sublinear time oracle that provides dot product access to the spectral embedding of the graph. We estimate dot products with high precision using an appropriate linear transformation of the Gram matrix of random walk transition probabilities. Finally, using dot product access to the spectral embedding we design a spectral clustering oracle. At a high level, our approach amounts to hyperplane partitioning in the spectral embedding of the graph but crucially operates on a nested sequence of carefully defined subspaces in the spectral embedding to achieve per cluster recovery guarantees.

Barbara Caputo, Ilja Kuzborskij

We study the binary transfer learning problem, focusing on how to select sources from a large pool and how to combine them to yield a good performance on a target task. In particular, we consider the transfer learning setting where one does not have direct access to the source data, but rather employs the source hypotheses trained from them. Building on the literature on the best subset selection problem, we propose an efficient algorithm that selects relevant source hypotheses and feature dimensions simultaneously. On three computer vision datasets we achieve state-of-the-art results, substantially outperforming transfer learning and popular feature selection baselines in a small-sample setting. Also, we theoretically prove that, under reasonable assumptions on the source hypotheses, our algorithm can learn effectively from few examples.

Barbara Caputo, Ilja Kuzborskij

In this paper we consider the binary transfer learning problem, focusing on how to select and combine sources from a large pool to yield a good performance on a target task. Constraining our scenario to real world, we do not assume the direct access to the source data, but rather we employ the source hypotheses trained from them. We propose an efficient algorithm that selects relevant source hypotheses and feature dimensions simultaneously, building on the literature on the best subset selection problem. Our algorithm achieves state-of-the-art results on three computer vision datasets, substantially outperforming both transfer learning and popular feature selection baselines in a small-sample setting. We also present a randomized variant that achieves the same results with the computational cost independent from the number of source hypotheses and feature dimensions. Also, we theoretically prove that, under reasonable assumptions on the source hypotheses, our algorithm can learn effectively from few examples.

2017