Why is Armadillo so slow compared to a C-style array in a simple row-wise computationnal task












3















I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.



More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.



I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.



Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm. Also tried to use ARMA_NO_DEBUG without success.



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";

//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;

Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));

std::vector< std::vector<double> > V(huge_mat.n_rows);

//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));

//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];

t1 = high_resolution_clock::now();

for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);

//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));

//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}

t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}


In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows the plain C tends to be (very much) faster. For a simple test using this
setup i got:



ARMADILLO: 111 secs
C ARRAY: 79 secs


in execution time.



EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs now), thus showing some improvments over working row-wise:



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}









share|improve this question




















  • 2





    First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

    – rubenvb
    Nov 23 '18 at 15:11






  • 1





    Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

    – Damien
    Nov 23 '18 at 15:29






  • 1





    Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

    – John Bollinger
    Nov 23 '18 at 15:50








  • 2





    This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

    – mtall
    Nov 26 '18 at 7:53








  • 2





    @Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

    – mtall
    Nov 26 '18 at 14:33
















3















I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.



More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.



I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.



Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm. Also tried to use ARMA_NO_DEBUG without success.



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";

//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;

Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));

std::vector< std::vector<double> > V(huge_mat.n_rows);

//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));

//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];

t1 = high_resolution_clock::now();

for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);

//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));

//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}

t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}


In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows the plain C tends to be (very much) faster. For a simple test using this
setup i got:



ARMADILLO: 111 secs
C ARRAY: 79 secs


in execution time.



EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs now), thus showing some improvments over working row-wise:



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}









share|improve this question




















  • 2





    First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

    – rubenvb
    Nov 23 '18 at 15:11






  • 1





    Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

    – Damien
    Nov 23 '18 at 15:29






  • 1





    Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

    – John Bollinger
    Nov 23 '18 at 15:50








  • 2





    This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

    – mtall
    Nov 26 '18 at 7:53








  • 2





    @Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

    – mtall
    Nov 26 '18 at 14:33














3












3








3








I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.



More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.



I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.



Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm. Also tried to use ARMA_NO_DEBUG without success.



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";

//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;

Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));

std::vector< std::vector<double> > V(huge_mat.n_rows);

//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));

//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];

t1 = high_resolution_clock::now();

for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);

//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));

//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}

t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}


In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows the plain C tends to be (very much) faster. For a simple test using this
setup i got:



ARMADILLO: 111 secs
C ARRAY: 79 secs


in execution time.



EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs now), thus showing some improvments over working row-wise:



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}









share|improve this question
















I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.



More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.



I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.



Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm. Also tried to use ARMA_NO_DEBUG without success.



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";

//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;

Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));

std::vector< std::vector<double> > V(huge_mat.n_rows);

//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));

//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];

t1 = high_resolution_clock::now();

for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);

//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));

//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}

t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}


In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows the plain C tends to be (very much) faster. For a simple test using this
setup i got:



ARMADILLO: 111 secs
C ARRAY: 79 secs


in execution time.



EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs now), thus showing some improvments over working row-wise:



#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols

const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.

high_resolution_clock::time_point t1 = high_resolution_clock::now();

//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line

//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {

//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}

high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}






c++ arrays matrix armadillo






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 10:24







Grasshoper

















asked Nov 23 '18 at 15:04









GrasshoperGrasshoper

527




527








  • 2





    First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

    – rubenvb
    Nov 23 '18 at 15:11






  • 1





    Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

    – Damien
    Nov 23 '18 at 15:29






  • 1





    Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

    – John Bollinger
    Nov 23 '18 at 15:50








  • 2





    This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

    – mtall
    Nov 26 '18 at 7:53








  • 2





    @Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

    – mtall
    Nov 26 '18 at 14:33














  • 2





    First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

    – rubenvb
    Nov 23 '18 at 15:11






  • 1





    Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

    – Damien
    Nov 23 '18 at 15:29






  • 1





    Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

    – John Bollinger
    Nov 23 '18 at 15:50








  • 2





    This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

    – mtall
    Nov 26 '18 at 7:53








  • 2





    @Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

    – mtall
    Nov 26 '18 at 14:33








2




2





First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11





First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.

– rubenvb
Nov 23 '18 at 15:11




1




1





Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29





Not related: A little optimisation for both codes. You can avoid the test k != j, then subtract the term corresponding to j.

– Damien
Nov 23 '18 at 15:29




1




1





Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50







Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.

– John Bollinger
Nov 23 '18 at 15:50






2




2





This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53







This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to Mat_full via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr(). This means you can directly use C like access when needed.

– mtall
Nov 26 '18 at 7:53






2




2





@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33





@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.

– mtall
Nov 26 '18 at 14:33












1 Answer
1






active

oldest

votes


















1














The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.



To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.






share|improve this answer
























  • Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

    – Grasshoper
    Nov 27 '18 at 13:22











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448997%2fwhy-is-armadillo-so-slow-compared-to-a-c-style-array-in-a-simple-row-wise-comput%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.



To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.






share|improve this answer
























  • Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

    – Grasshoper
    Nov 27 '18 at 13:22
















1














The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.



To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.






share|improve this answer
























  • Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

    – Grasshoper
    Nov 27 '18 at 13:22














1












1








1







The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.



To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.






share|improve this answer













The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.



To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 27 '18 at 10:56









The Quantum PhysicistThe Quantum Physicist

11.6k64596




11.6k64596













  • Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

    – Grasshoper
    Nov 27 '18 at 13:22



















  • Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

    – Grasshoper
    Nov 27 '18 at 13:22

















Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22





Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.

– Grasshoper
Nov 27 '18 at 13:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448997%2fwhy-is-armadillo-so-slow-compared-to-a-c-style-array-in-a-simple-row-wise-comput%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Berounka

Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

Sphinx de Gizeh