Why is Armadillo so slow compared to a C-style array in a simple row-wise computationnal task
I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.
More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.
I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.
Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
. Also tried to use ARMA_NO_DEBUG
without success.
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}
In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows
the plain C tends to be (very much) faster. For a simple test using this
setup i got:
ARMADILLO: 111 secs
C ARRAY: 79 secs
in execution time.
EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs
now), thus showing some improvments over working row-wise:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}
c++ arrays matrix armadillo
|
show 5 more comments
I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.
More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.
I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.
Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
. Also tried to use ARMA_NO_DEBUG
without success.
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}
In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows
the plain C tends to be (very much) faster. For a simple test using this
setup i got:
ARMADILLO: 111 secs
C ARRAY: 79 secs
in execution time.
EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs
now), thus showing some improvments over working row-wise:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}
c++ arrays matrix armadillo
2
First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.
– rubenvb
Nov 23 '18 at 15:11
1
Not related: A little optimisation for both codes. You can avoid the testk != j
, then subtract the term corresponding toj
.
– Damien
Nov 23 '18 at 15:29
1
Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.
– John Bollinger
Nov 23 '18 at 15:50
2
This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data toMat_full
via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example:double* data = huge_mat.memptr()
. This means you can directly use C like access when needed.
– mtall
Nov 26 '18 at 7:53
2
@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.
– mtall
Nov 26 '18 at 14:33
|
show 5 more comments
I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.
More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.
I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.
Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
. Also tried to use ARMA_NO_DEBUG
without success.
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}
In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows
the plain C tends to be (very much) faster. For a simple test using this
setup i got:
ARMADILLO: 111 secs
C ARRAY: 79 secs
in execution time.
EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs
now), thus showing some improvments over working row-wise:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}
c++ arrays matrix armadillo
I'm currently computing a small quantity for each value of a big matrix (millions of rows, number of columns < 1000) while considering each row independently.
More precisely, for each value M(i,j) in each row i, column j of this matrix, the quantity is simply [ M(i,j) - mean(i,s) ] / std(i,s) where s is the subset s in M(i,:) - j
in other words, s is the subset of all values of row i without value j.
I compared two implementations, one in C-style array and one in Armadillo, and Armadillo is roughly twice slower in termes of execution time. I would expect a similar or slighty slower execution time, but plain C arrays seem to dramatically improve the performance.
Is there any particular reason or somthing that I missed somewhere? Here is a example compiled with: -O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm
. Also tried to use ARMA_NO_DEBUG
without success.
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 2000000; //number of rows
unsigned ncols = 100; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < nrows; i++) {
const arma::rowvec current_line = huge_mat.row(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < ncols; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, ncols-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
//------------------PLAIN C Array
double *Mat_full;
double *output;
unsigned int i,j,k;
double mean=0, stdd=0;
double sq_diff_sum = 0, sum=0;
double diff = 0;
Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
output = (double *) malloc(nrows * ncols * sizeof(double));
std::vector< std::vector<double> > V(huge_mat.n_rows);
//Some UGLY copy from arma::mat to double* using a vector:
for (size_t i = 0; i < huge_mat.n_rows; ++i)
V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));
//then dump to Mat_full array:
for (i=0; i < V.size(); i++)
for (j=0; j < V[i].size(); j++)
Mat_full[i + huge_mat.n_rows * j] = V[i][j];
t1 = high_resolution_clock::now();
for(i=0; i < nrows; i++)
for(j=0; j < ncols; j++)
{
//compute mean of subset-------------------
sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
sum = sum + Mat_full[i+k*nrows];
}
mean = sum / (ncols-1);
//compute standard deviation of subset-----
sq_diff_sum = 0;
for(k = 0; k < ncols; k++)
if(k!=j)
{
diff = Mat_full[i+k*nrows] - mean;
sq_diff_sum += diff * diff;
}
stdd = sqrt(sq_diff_sum / (ncols-2));
//export to plain C array:
output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
}
t2 = high_resolution_clock::now();
duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "C ARRAY: " << duration << " secsn";
}
In particular the calls to arma::mean and arma::stddev seem to perform poorly when comparing execution times. I did not perform any in-depth analyse of the size-effect over performance, but it seems that for small values of nrows
the plain C tends to be (very much) faster. For a simple test using this
setup i got:
ARMADILLO: 111 secs
C ARRAY: 79 secs
in execution time.
EDIT
Here is modification where we work column-wise instead of row-wise and treat each column independently, as suggested by @rubenvb and @mtall. The resulting execution time slightly is decreased (ARMADILLO: 104 secs
now), thus showing some improvments over working row-wise:
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>
using namespace std::chrono;
/***************************
* main()
***************************/
int main( int argc, char *argv )
{
unsigned nrows = 100; //number of rows
unsigned ncols = 2000000; //number of cols
const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix
const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.
high_resolution_clock::time_point t1 = high_resolution_clock::now();
//compute some normalization over each value of line of this huge matrix:
for(unsigned i=0; i < ncols; i++) {
const arma::colvec current_line = huge_mat.col(i); //extract current line
//for each observation in current_line:
for(unsigned j=0; j < nrows; j++) {
//Take care of side effects first:
if( j == 0 )
inds = current_line(arma::span(1, nrows-1));
else
if( j == 1 ) {
inds(0) = current_line(0);
inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
} else
inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );
//Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
}
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<seconds>( t2 - t1 ).count();
std::cout << "ARMADILLO: " << duration << " secsn";
}
c++ arrays matrix armadillo
c++ arrays matrix armadillo
edited Nov 27 '18 at 10:24
Grasshoper
asked Nov 23 '18 at 15:04
GrasshoperGrasshoper
527
527
2
First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.
– rubenvb
Nov 23 '18 at 15:11
1
Not related: A little optimisation for both codes. You can avoid the testk != j
, then subtract the term corresponding toj
.
– Damien
Nov 23 '18 at 15:29
1
Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.
– John Bollinger
Nov 23 '18 at 15:50
2
This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data toMat_full
via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example:double* data = huge_mat.memptr()
. This means you can directly use C like access when needed.
– mtall
Nov 26 '18 at 7:53
2
@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.
– mtall
Nov 26 '18 at 14:33
|
show 5 more comments
2
First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.
– rubenvb
Nov 23 '18 at 15:11
1
Not related: A little optimisation for both codes. You can avoid the testk != j
, then subtract the term corresponding toj
.
– Damien
Nov 23 '18 at 15:29
1
Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.
– John Bollinger
Nov 23 '18 at 15:50
2
This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data toMat_full
via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example:double* data = huge_mat.memptr()
. This means you can directly use C like access when needed.
– mtall
Nov 26 '18 at 7:53
2
@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.
– mtall
Nov 26 '18 at 14:33
2
2
First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.
– rubenvb
Nov 23 '18 at 15:11
First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.
– rubenvb
Nov 23 '18 at 15:11
1
1
Not related: A little optimisation for both codes. You can avoid the test
k != j
, then subtract the term corresponding to j
.– Damien
Nov 23 '18 at 15:29
Not related: A little optimisation for both codes. You can avoid the test
k != j
, then subtract the term corresponding to j
.– Damien
Nov 23 '18 at 15:29
1
1
Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.
– John Bollinger
Nov 23 '18 at 15:50
Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.
– John Bollinger
Nov 23 '18 at 15:50
2
2
This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to
Mat_full
via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr()
. This means you can directly use C like access when needed.– mtall
Nov 26 '18 at 7:53
This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to
Mat_full
via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example: double* data = huge_mat.memptr()
. This means you can directly use C like access when needed.– mtall
Nov 26 '18 at 7:53
2
2
@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.
– mtall
Nov 26 '18 at 14:33
@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.
– mtall
Nov 26 '18 at 14:33
|
show 5 more comments
1 Answer
1
active
oldest
votes
The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.
To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448997%2fwhy-is-armadillo-so-slow-compared-to-a-c-style-array-in-a-simple-row-wise-comput%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.
To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
add a comment |
The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.
To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
add a comment |
The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.
To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.
The reason is that Armadillo uses column-major ordering in mat, while your C array uses row-major ordering. This is kind of a big deal because your processor can use instruction vectorization to process multiple elements at once, where this requires contiguous memory chunks.
To verify whether this is the cause, do the same calculation but for columns instead of rows, and check the difference.
answered Nov 27 '18 at 10:56
The Quantum PhysicistThe Quantum Physicist
11.6k64596
11.6k64596
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
add a comment |
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
Thank you, I already did the recommanded modification in the edit of the original post, and it improves efficientcy.
– Grasshoper
Nov 27 '18 at 13:22
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448997%2fwhy-is-armadillo-so-slow-compared-to-a-c-style-array-in-a-simple-row-wise-comput%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
First thing I would try is to reformulate your calculation into a matrix operation if possible. Second thing I would try is exchanging rows and columns which depending on the underlying data might leave you continuously cache missing here.
– rubenvb
Nov 23 '18 at 15:11
1
Not related: A little optimisation for both codes. You can avoid the test
k != j
, then subtract the term corresponding toj
.– Damien
Nov 23 '18 at 15:29
1
Be careful, @Damien. Computer arithmetic, especially floating-point arithmetic, is not algebraic. Order of operations can matter, so the optimization you suggest may lead to a different result being computed.
– John Bollinger
Nov 23 '18 at 15:50
2
This appears to be a case of the so-called abstraction penalty, where you "pay" for the nice syntax and easy to use features through somewhat reduced speed. In your code there is no need to shuffle data to
Mat_full
via the std::vector intermediary. You can directly access matrix data in Armadillo matrices via the .memptr() member function. Example:double* data = huge_mat.memptr()
. This means you can directly use C like access when needed.– mtall
Nov 26 '18 at 7:53
2
@Grasshoper Also take into account that Armadillo stores data in column major format. Looks like your code accesses data rows by rows. This is not a good match.
– mtall
Nov 26 '18 at 14:33