Numpy transformation to normal distribution
up vote
1
down vote
favorite
I have an array of data. I checked if it was normally distributed:
import sys
import scipy
from scipy import stats
from scipy.stats import mstats
from scipy.stats import normaltest
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
print scipy.stats.normaltest(Data)
The output was: (36.444648754208075, 1.2193968690198398e-08)
Then, I wrote a small script to normalise the data:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
And then I checked for normality again using the first script and the output was
(36.444648754209595, 1.2193968690189117e-08).
...the same as the previous score, and not normally distributed.
is one of my scripts wrong?
Also, should I mention that the average of my data is 0.056, the numbers range from 0.014 to 0.171 (85 observations), I'm not sure if the fact that the numbers are so small matters.
A sample of the untransformed and transformed data:
Untransformed:
0.055
0.074
0.049
0.067
0.038
0.037
0.045
0.041
Transformed data:
-2.13696814254
-2.11796814254
-2.14296814254
-2.12496814254
-2.15396814254
-2.15496814254
-2.14696814254
Edit 1:
When I edit the code slightly to account for parenthesis being in the wrong place:
TransformedMean = (UntransformedArray - np.mean(UntransformedArray))
TransformedArray = (TransformedMean/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
The output I get it different:
Example:
-0.0385683544143
0.705333390576
-0.273484694937
0.431264326632
-0.704164652563
-0.743317375984
However, when I check for normality:
(36.444648754241328, 1.2193968689995659e-08)
It is still not normally distributed (and is still the exact same score as the other times)?
Edit 2:
I then tried a different method of normalising the data:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
Data = [(float(line.strip())) for line in open(sys.argv[1])]
scipy.stats.boxcox(Data)
I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'float'
EDIT 3: Due to comment from user, the problem was understanding the difference in normalising values, versus normalising a distribution.
Edited code:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
List1 = np.log(UntransformedArray)
for i in List1:
print i
Checking for normalisation:
(4.0435072214905938, 0.13242304287973003)
(works in this case, depending on skewness of the data).
Edit 4: Or using a BoxCox transformation:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
import numpy as np
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
data = scipy.stats.boxcox(np.array(Data))
for i in data[0]:
print i
Check for normalisation: (2.9085877478631956, 0.23356523218452238)
python numpy normalization
add a comment |
up vote
1
down vote
favorite
I have an array of data. I checked if it was normally distributed:
import sys
import scipy
from scipy import stats
from scipy.stats import mstats
from scipy.stats import normaltest
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
print scipy.stats.normaltest(Data)
The output was: (36.444648754208075, 1.2193968690198398e-08)
Then, I wrote a small script to normalise the data:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
And then I checked for normality again using the first script and the output was
(36.444648754209595, 1.2193968690189117e-08).
...the same as the previous score, and not normally distributed.
is one of my scripts wrong?
Also, should I mention that the average of my data is 0.056, the numbers range from 0.014 to 0.171 (85 observations), I'm not sure if the fact that the numbers are so small matters.
A sample of the untransformed and transformed data:
Untransformed:
0.055
0.074
0.049
0.067
0.038
0.037
0.045
0.041
Transformed data:
-2.13696814254
-2.11796814254
-2.14296814254
-2.12496814254
-2.15396814254
-2.15496814254
-2.14696814254
Edit 1:
When I edit the code slightly to account for parenthesis being in the wrong place:
TransformedMean = (UntransformedArray - np.mean(UntransformedArray))
TransformedArray = (TransformedMean/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
The output I get it different:
Example:
-0.0385683544143
0.705333390576
-0.273484694937
0.431264326632
-0.704164652563
-0.743317375984
However, when I check for normality:
(36.444648754241328, 1.2193968689995659e-08)
It is still not normally distributed (and is still the exact same score as the other times)?
Edit 2:
I then tried a different method of normalising the data:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
Data = [(float(line.strip())) for line in open(sys.argv[1])]
scipy.stats.boxcox(Data)
I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'float'
EDIT 3: Due to comment from user, the problem was understanding the difference in normalising values, versus normalising a distribution.
Edited code:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
List1 = np.log(UntransformedArray)
for i in List1:
print i
Checking for normalisation:
(4.0435072214905938, 0.13242304287973003)
(works in this case, depending on skewness of the data).
Edit 4: Or using a BoxCox transformation:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
import numpy as np
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
data = scipy.stats.boxcox(np.array(Data))
for i in data[0]:
print i
Check for normalisation: (2.9085877478631956, 0.23356523218452238)
python numpy normalization
1
Don't you have a parenthesis problem in the TransformedArray calc? ( UntransformedArray - np.mean(UntransformedArray) ) /np.std(UntransformedArray)
– joao
Nov 30 '15 at 13:30
This is what I have:TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray)) and it seems to run without complaining? Don't get any error about parenthesis?
– Tom
Nov 30 '15 at 13:42
1
Arithmetic division (/) has not the same priority has the minus (-) operation. Thus, you are dividing the mean/std, and then only after the subtraction is applied. I believe your parenthesis are misplaced there.
– joao
Nov 30 '15 at 13:51
Thanks. I've changed the script slightly (see edit). Is it possibly something wrong with the checking for normality script? The reason I ask is that now I've given the checking for normality script two different lists, (for example, my original transformed output, where all the numbers start with -2.XXX, and in my edit, where the numbers are e.g. 0.43, -0.7 etc), and I still get the exact same output from checking for normality script?
– Tom
Nov 30 '15 at 14:21
Re.boxcox
: Tryscipy.stats.boxcox(np.array(Data))
(and addimport numpy as np
at the top of your script if you don't already have it). By the way,scipy.stats.boxcox(Data)
works in newer versions ofscipy
. What version are you using? Runimport scipy; print(scipy.__version__)
to find out.
– Warren Weckesser
Nov 30 '15 at 15:16
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have an array of data. I checked if it was normally distributed:
import sys
import scipy
from scipy import stats
from scipy.stats import mstats
from scipy.stats import normaltest
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
print scipy.stats.normaltest(Data)
The output was: (36.444648754208075, 1.2193968690198398e-08)
Then, I wrote a small script to normalise the data:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
And then I checked for normality again using the first script and the output was
(36.444648754209595, 1.2193968690189117e-08).
...the same as the previous score, and not normally distributed.
is one of my scripts wrong?
Also, should I mention that the average of my data is 0.056, the numbers range from 0.014 to 0.171 (85 observations), I'm not sure if the fact that the numbers are so small matters.
A sample of the untransformed and transformed data:
Untransformed:
0.055
0.074
0.049
0.067
0.038
0.037
0.045
0.041
Transformed data:
-2.13696814254
-2.11796814254
-2.14296814254
-2.12496814254
-2.15396814254
-2.15496814254
-2.14696814254
Edit 1:
When I edit the code slightly to account for parenthesis being in the wrong place:
TransformedMean = (UntransformedArray - np.mean(UntransformedArray))
TransformedArray = (TransformedMean/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
The output I get it different:
Example:
-0.0385683544143
0.705333390576
-0.273484694937
0.431264326632
-0.704164652563
-0.743317375984
However, when I check for normality:
(36.444648754241328, 1.2193968689995659e-08)
It is still not normally distributed (and is still the exact same score as the other times)?
Edit 2:
I then tried a different method of normalising the data:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
Data = [(float(line.strip())) for line in open(sys.argv[1])]
scipy.stats.boxcox(Data)
I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'float'
EDIT 3: Due to comment from user, the problem was understanding the difference in normalising values, versus normalising a distribution.
Edited code:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
List1 = np.log(UntransformedArray)
for i in List1:
print i
Checking for normalisation:
(4.0435072214905938, 0.13242304287973003)
(works in this case, depending on skewness of the data).
Edit 4: Or using a BoxCox transformation:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
import numpy as np
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
data = scipy.stats.boxcox(np.array(Data))
for i in data[0]:
print i
Check for normalisation: (2.9085877478631956, 0.23356523218452238)
python numpy normalization
I have an array of data. I checked if it was normally distributed:
import sys
import scipy
from scipy import stats
from scipy.stats import mstats
from scipy.stats import normaltest
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
print scipy.stats.normaltest(Data)
The output was: (36.444648754208075, 1.2193968690198398e-08)
Then, I wrote a small script to normalise the data:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
And then I checked for normality again using the first script and the output was
(36.444648754209595, 1.2193968690189117e-08).
...the same as the previous score, and not normally distributed.
is one of my scripts wrong?
Also, should I mention that the average of my data is 0.056, the numbers range from 0.014 to 0.171 (85 observations), I'm not sure if the fact that the numbers are so small matters.
A sample of the untransformed and transformed data:
Untransformed:
0.055
0.074
0.049
0.067
0.038
0.037
0.045
0.041
Transformed data:
-2.13696814254
-2.11796814254
-2.14296814254
-2.12496814254
-2.15396814254
-2.15496814254
-2.14696814254
Edit 1:
When I edit the code slightly to account for parenthesis being in the wrong place:
TransformedMean = (UntransformedArray - np.mean(UntransformedArray))
TransformedArray = (TransformedMean/np.std(UntransformedArray))
NewList = TransformedArray.tolist()
for i in NewList:
print i
The output I get it different:
Example:
-0.0385683544143
0.705333390576
-0.273484694937
0.431264326632
-0.704164652563
-0.743317375984
However, when I check for normality:
(36.444648754241328, 1.2193968689995659e-08)
It is still not normally distributed (and is still the exact same score as the other times)?
Edit 2:
I then tried a different method of normalising the data:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
Data = [(float(line.strip())) for line in open(sys.argv[1])]
scipy.stats.boxcox(Data)
I get the error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'float'
EDIT 3: Due to comment from user, the problem was understanding the difference in normalising values, versus normalising a distribution.
Edited code:
import sys
import numpy as np
fileopen = open(sys.argv[1])
UntransformedArray =
for line in fileopen:
line = float(line.strip())
UntransformedArray.append(line)
List1 = np.log(UntransformedArray)
for i in List1:
print i
Checking for normalisation:
(4.0435072214905938, 0.13242304287973003)
(works in this case, depending on skewness of the data).
Edit 4: Or using a BoxCox transformation:
import sys
import scipy
from scipy import stats
from scipy.stats import boxcox
import numpy as np
Data =
for line in open(sys.argv[1]):
line = line.strip()
Data.append(float(line))
data = scipy.stats.boxcox(np.array(Data))
for i in data[0]:
print i
Check for normalisation: (2.9085877478631956, 0.23356523218452238)
python numpy normalization
python numpy normalization
edited Nov 30 '15 at 15:25
asked Nov 30 '15 at 13:23
Tom
8812
8812
1
Don't you have a parenthesis problem in the TransformedArray calc? ( UntransformedArray - np.mean(UntransformedArray) ) /np.std(UntransformedArray)
– joao
Nov 30 '15 at 13:30
This is what I have:TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray)) and it seems to run without complaining? Don't get any error about parenthesis?
– Tom
Nov 30 '15 at 13:42
1
Arithmetic division (/) has not the same priority has the minus (-) operation. Thus, you are dividing the mean/std, and then only after the subtraction is applied. I believe your parenthesis are misplaced there.
– joao
Nov 30 '15 at 13:51
Thanks. I've changed the script slightly (see edit). Is it possibly something wrong with the checking for normality script? The reason I ask is that now I've given the checking for normality script two different lists, (for example, my original transformed output, where all the numbers start with -2.XXX, and in my edit, where the numbers are e.g. 0.43, -0.7 etc), and I still get the exact same output from checking for normality script?
– Tom
Nov 30 '15 at 14:21
Re.boxcox
: Tryscipy.stats.boxcox(np.array(Data))
(and addimport numpy as np
at the top of your script if you don't already have it). By the way,scipy.stats.boxcox(Data)
works in newer versions ofscipy
. What version are you using? Runimport scipy; print(scipy.__version__)
to find out.
– Warren Weckesser
Nov 30 '15 at 15:16
add a comment |
1
Don't you have a parenthesis problem in the TransformedArray calc? ( UntransformedArray - np.mean(UntransformedArray) ) /np.std(UntransformedArray)
– joao
Nov 30 '15 at 13:30
This is what I have:TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray)) and it seems to run without complaining? Don't get any error about parenthesis?
– Tom
Nov 30 '15 at 13:42
1
Arithmetic division (/) has not the same priority has the minus (-) operation. Thus, you are dividing the mean/std, and then only after the subtraction is applied. I believe your parenthesis are misplaced there.
– joao
Nov 30 '15 at 13:51
Thanks. I've changed the script slightly (see edit). Is it possibly something wrong with the checking for normality script? The reason I ask is that now I've given the checking for normality script two different lists, (for example, my original transformed output, where all the numbers start with -2.XXX, and in my edit, where the numbers are e.g. 0.43, -0.7 etc), and I still get the exact same output from checking for normality script?
– Tom
Nov 30 '15 at 14:21
Re.boxcox
: Tryscipy.stats.boxcox(np.array(Data))
(and addimport numpy as np
at the top of your script if you don't already have it). By the way,scipy.stats.boxcox(Data)
works in newer versions ofscipy
. What version are you using? Runimport scipy; print(scipy.__version__)
to find out.
– Warren Weckesser
Nov 30 '15 at 15:16
1
1
Don't you have a parenthesis problem in the TransformedArray calc? ( UntransformedArray - np.mean(UntransformedArray) ) /np.std(UntransformedArray)
– joao
Nov 30 '15 at 13:30
Don't you have a parenthesis problem in the TransformedArray calc? ( UntransformedArray - np.mean(UntransformedArray) ) /np.std(UntransformedArray)
– joao
Nov 30 '15 at 13:30
This is what I have:TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray)) and it seems to run without complaining? Don't get any error about parenthesis?
– Tom
Nov 30 '15 at 13:42
This is what I have:TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray)) and it seems to run without complaining? Don't get any error about parenthesis?
– Tom
Nov 30 '15 at 13:42
1
1
Arithmetic division (/) has not the same priority has the minus (-) operation. Thus, you are dividing the mean/std, and then only after the subtraction is applied. I believe your parenthesis are misplaced there.
– joao
Nov 30 '15 at 13:51
Arithmetic division (/) has not the same priority has the minus (-) operation. Thus, you are dividing the mean/std, and then only after the subtraction is applied. I believe your parenthesis are misplaced there.
– joao
Nov 30 '15 at 13:51
Thanks. I've changed the script slightly (see edit). Is it possibly something wrong with the checking for normality script? The reason I ask is that now I've given the checking for normality script two different lists, (for example, my original transformed output, where all the numbers start with -2.XXX, and in my edit, where the numbers are e.g. 0.43, -0.7 etc), and I still get the exact same output from checking for normality script?
– Tom
Nov 30 '15 at 14:21
Thanks. I've changed the script slightly (see edit). Is it possibly something wrong with the checking for normality script? The reason I ask is that now I've given the checking for normality script two different lists, (for example, my original transformed output, where all the numbers start with -2.XXX, and in my edit, where the numbers are e.g. 0.43, -0.7 etc), and I still get the exact same output from checking for normality script?
– Tom
Nov 30 '15 at 14:21
Re.
boxcox
: Try scipy.stats.boxcox(np.array(Data))
(and add import numpy as np
at the top of your script if you don't already have it). By the way, scipy.stats.boxcox(Data)
works in newer versions of scipy
. What version are you using? Run import scipy; print(scipy.__version__)
to find out.– Warren Weckesser
Nov 30 '15 at 15:16
Re.
boxcox
: Try scipy.stats.boxcox(np.array(Data))
(and add import numpy as np
at the top of your script if you don't already have it). By the way, scipy.stats.boxcox(Data)
works in newer versions of scipy
. What version are you using? Run import scipy; print(scipy.__version__)
to find out.– Warren Weckesser
Nov 30 '15 at 15:16
add a comment |
3 Answers
3
active
oldest
votes
up vote
2
down vote
As expected, subtracting the mean and rescaling to unit variance does not change the shape of the distribution. normaltest
correctly returns the same output in both cases, telling you that your data is not normally distributed.
add a comment |
up vote
1
down vote
I agree with Thomas. But to be more precise: You are standardizing the distribution of your array! This does not change the shape of the distribution! You might want to use the numpy.histogram() function to get an impression of the distributions!
I think you have fallen prey to the confusing double usage of 'normalization'. On the one hand, normalization is used to describe standardization of variables (getting variables on the same scale - this is what you did). On the other hand, normalization is used to describe attempts of changing the shape of a probability distribution (the scipy.stats.normaltest() is used to check the shape of such distributions). One easy strategy to try to get a distribution more normally is to use a log transformation. numpy.log() might do the trick here, but only if the original distribution is not too skewed.
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
glad it helped!
– Dominix
Nov 30 '15 at 20:52
add a comment |
up vote
0
down vote
I came across the same problem. My data was not normal like yours and I had to transform my data to a normal distribution. For transforming your data to normal you should use normal score transform by different methods like as it is described here. You can also use these formulas. I have written a python code for changing your list of elements to normal distribution as follows:
X = [0.055, 0.074, 0.049, 0.067, 0.038, 0.037, 0.045, 0.041]
from scipy.stats import rankdata, norm
newX = norm.ppf(rankdata(x)/(len(x) + 1))
print(newX)
output:
[ 0.4307273 1.22064035 0.1397103 0.76470967 -0.76470967 -1.22064035
-0.1397103 -0.4307273 ]
You can see that your new data is completely normal after this transformation as you can see by Q-Q plot:
from scipy import stats
import matplotlib.pyplot as plt
ax4 = plt.subplot(111)
res = stats.probplot(newX, plot=plt)
plt.show()
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
As expected, subtracting the mean and rescaling to unit variance does not change the shape of the distribution. normaltest
correctly returns the same output in both cases, telling you that your data is not normally distributed.
add a comment |
up vote
2
down vote
As expected, subtracting the mean and rescaling to unit variance does not change the shape of the distribution. normaltest
correctly returns the same output in both cases, telling you that your data is not normally distributed.
add a comment |
up vote
2
down vote
up vote
2
down vote
As expected, subtracting the mean and rescaling to unit variance does not change the shape of the distribution. normaltest
correctly returns the same output in both cases, telling you that your data is not normally distributed.
As expected, subtracting the mean and rescaling to unit variance does not change the shape of the distribution. normaltest
correctly returns the same output in both cases, telling you that your data is not normally distributed.
answered Nov 30 '15 at 14:19
thomas
1,200513
1,200513
add a comment |
add a comment |
up vote
1
down vote
I agree with Thomas. But to be more precise: You are standardizing the distribution of your array! This does not change the shape of the distribution! You might want to use the numpy.histogram() function to get an impression of the distributions!
I think you have fallen prey to the confusing double usage of 'normalization'. On the one hand, normalization is used to describe standardization of variables (getting variables on the same scale - this is what you did). On the other hand, normalization is used to describe attempts of changing the shape of a probability distribution (the scipy.stats.normaltest() is used to check the shape of such distributions). One easy strategy to try to get a distribution more normally is to use a log transformation. numpy.log() might do the trick here, but only if the original distribution is not too skewed.
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
glad it helped!
– Dominix
Nov 30 '15 at 20:52
add a comment |
up vote
1
down vote
I agree with Thomas. But to be more precise: You are standardizing the distribution of your array! This does not change the shape of the distribution! You might want to use the numpy.histogram() function to get an impression of the distributions!
I think you have fallen prey to the confusing double usage of 'normalization'. On the one hand, normalization is used to describe standardization of variables (getting variables on the same scale - this is what you did). On the other hand, normalization is used to describe attempts of changing the shape of a probability distribution (the scipy.stats.normaltest() is used to check the shape of such distributions). One easy strategy to try to get a distribution more normally is to use a log transformation. numpy.log() might do the trick here, but only if the original distribution is not too skewed.
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
glad it helped!
– Dominix
Nov 30 '15 at 20:52
add a comment |
up vote
1
down vote
up vote
1
down vote
I agree with Thomas. But to be more precise: You are standardizing the distribution of your array! This does not change the shape of the distribution! You might want to use the numpy.histogram() function to get an impression of the distributions!
I think you have fallen prey to the confusing double usage of 'normalization'. On the one hand, normalization is used to describe standardization of variables (getting variables on the same scale - this is what you did). On the other hand, normalization is used to describe attempts of changing the shape of a probability distribution (the scipy.stats.normaltest() is used to check the shape of such distributions). One easy strategy to try to get a distribution more normally is to use a log transformation. numpy.log() might do the trick here, but only if the original distribution is not too skewed.
I agree with Thomas. But to be more precise: You are standardizing the distribution of your array! This does not change the shape of the distribution! You might want to use the numpy.histogram() function to get an impression of the distributions!
I think you have fallen prey to the confusing double usage of 'normalization'. On the one hand, normalization is used to describe standardization of variables (getting variables on the same scale - this is what you did). On the other hand, normalization is used to describe attempts of changing the shape of a probability distribution (the scipy.stats.normaltest() is used to check the shape of such distributions). One easy strategy to try to get a distribution more normally is to use a log transformation. numpy.log() might do the trick here, but only if the original distribution is not too skewed.
answered Nov 30 '15 at 15:10
Dominix
814
814
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
glad it helped!
– Dominix
Nov 30 '15 at 20:52
add a comment |
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
glad it helped!
– Dominix
Nov 30 '15 at 20:52
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
This was really useful thank you, particularly in the clarification of the understanding. I have made an edit with the updated code that I used.
– Tom
Nov 30 '15 at 15:16
glad it helped!
– Dominix
Nov 30 '15 at 20:52
glad it helped!
– Dominix
Nov 30 '15 at 20:52
add a comment |
up vote
0
down vote
I came across the same problem. My data was not normal like yours and I had to transform my data to a normal distribution. For transforming your data to normal you should use normal score transform by different methods like as it is described here. You can also use these formulas. I have written a python code for changing your list of elements to normal distribution as follows:
X = [0.055, 0.074, 0.049, 0.067, 0.038, 0.037, 0.045, 0.041]
from scipy.stats import rankdata, norm
newX = norm.ppf(rankdata(x)/(len(x) + 1))
print(newX)
output:
[ 0.4307273 1.22064035 0.1397103 0.76470967 -0.76470967 -1.22064035
-0.1397103 -0.4307273 ]
You can see that your new data is completely normal after this transformation as you can see by Q-Q plot:
from scipy import stats
import matplotlib.pyplot as plt
ax4 = plt.subplot(111)
res = stats.probplot(newX, plot=plt)
plt.show()
add a comment |
up vote
0
down vote
I came across the same problem. My data was not normal like yours and I had to transform my data to a normal distribution. For transforming your data to normal you should use normal score transform by different methods like as it is described here. You can also use these formulas. I have written a python code for changing your list of elements to normal distribution as follows:
X = [0.055, 0.074, 0.049, 0.067, 0.038, 0.037, 0.045, 0.041]
from scipy.stats import rankdata, norm
newX = norm.ppf(rankdata(x)/(len(x) + 1))
print(newX)
output:
[ 0.4307273 1.22064035 0.1397103 0.76470967 -0.76470967 -1.22064035
-0.1397103 -0.4307273 ]
You can see that your new data is completely normal after this transformation as you can see by Q-Q plot:
from scipy import stats
import matplotlib.pyplot as plt
ax4 = plt.subplot(111)
res = stats.probplot(newX, plot=plt)
plt.show()
add a comment |
up vote
0
down vote
up vote
0
down vote
I came across the same problem. My data was not normal like yours and I had to transform my data to a normal distribution. For transforming your data to normal you should use normal score transform by different methods like as it is described here. You can also use these formulas. I have written a python code for changing your list of elements to normal distribution as follows:
X = [0.055, 0.074, 0.049, 0.067, 0.038, 0.037, 0.045, 0.041]
from scipy.stats import rankdata, norm
newX = norm.ppf(rankdata(x)/(len(x) + 1))
print(newX)
output:
[ 0.4307273 1.22064035 0.1397103 0.76470967 -0.76470967 -1.22064035
-0.1397103 -0.4307273 ]
You can see that your new data is completely normal after this transformation as you can see by Q-Q plot:
from scipy import stats
import matplotlib.pyplot as plt
ax4 = plt.subplot(111)
res = stats.probplot(newX, plot=plt)
plt.show()
I came across the same problem. My data was not normal like yours and I had to transform my data to a normal distribution. For transforming your data to normal you should use normal score transform by different methods like as it is described here. You can also use these formulas. I have written a python code for changing your list of elements to normal distribution as follows:
X = [0.055, 0.074, 0.049, 0.067, 0.038, 0.037, 0.045, 0.041]
from scipy.stats import rankdata, norm
newX = norm.ppf(rankdata(x)/(len(x) + 1))
print(newX)
output:
[ 0.4307273 1.22064035 0.1397103 0.76470967 -0.76470967 -1.22064035
-0.1397103 -0.4307273 ]
You can see that your new data is completely normal after this transformation as you can see by Q-Q plot:
from scipy import stats
import matplotlib.pyplot as plt
ax4 = plt.subplot(111)
res = stats.probplot(newX, plot=plt)
plt.show()
edited Nov 21 at 21:53
answered Nov 21 at 21:31
Sara
1086
1086
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33999669%2fnumpy-transformation-to-normal-distribution%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Don't you have a parenthesis problem in the TransformedArray calc? ( UntransformedArray - np.mean(UntransformedArray) ) /np.std(UntransformedArray)
– joao
Nov 30 '15 at 13:30
This is what I have:TransformedArray = (UntransformedArray - np.mean(UntransformedArray)/np.std(UntransformedArray)) and it seems to run without complaining? Don't get any error about parenthesis?
– Tom
Nov 30 '15 at 13:42
1
Arithmetic division (/) has not the same priority has the minus (-) operation. Thus, you are dividing the mean/std, and then only after the subtraction is applied. I believe your parenthesis are misplaced there.
– joao
Nov 30 '15 at 13:51
Thanks. I've changed the script slightly (see edit). Is it possibly something wrong with the checking for normality script? The reason I ask is that now I've given the checking for normality script two different lists, (for example, my original transformed output, where all the numbers start with -2.XXX, and in my edit, where the numbers are e.g. 0.43, -0.7 etc), and I still get the exact same output from checking for normality script?
– Tom
Nov 30 '15 at 14:21
Re.
boxcox
: Tryscipy.stats.boxcox(np.array(Data))
(and addimport numpy as np
at the top of your script if you don't already have it). By the way,scipy.stats.boxcox(Data)
works in newer versions ofscipy
. What version are you using? Runimport scipy; print(scipy.__version__)
to find out.– Warren Weckesser
Nov 30 '15 at 15:16