How to group a dataframe and summarize over subgroups of consecutive numbers in Python?
up vote
3
down vote
favorite
I have a dataframe with a column containing ids and other column containing numbers:
df1 = {'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]}
You may note that each Id has their correponding series of consecutive numbers in the column "Number". For example:
Id 400 contains a series of length 4 {1, 2, 3, 4} and another of length 2 {8, 9}
I´d like to obtain for each Id, the average length of their corresponding series.
In this example:
df2 = {'ID':[400, 500], 'avg_length':[3, 2]}
Any ideas will be much appreciated!
python pandas pandas-groupby group-summaries
add a comment |
up vote
3
down vote
favorite
I have a dataframe with a column containing ids and other column containing numbers:
df1 = {'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]}
You may note that each Id has their correponding series of consecutive numbers in the column "Number". For example:
Id 400 contains a series of length 4 {1, 2, 3, 4} and another of length 2 {8, 9}
I´d like to obtain for each Id, the average length of their corresponding series.
In this example:
df2 = {'ID':[400, 500], 'avg_length':[3, 2]}
Any ideas will be much appreciated!
python pandas pandas-groupby group-summaries
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a dataframe with a column containing ids and other column containing numbers:
df1 = {'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]}
You may note that each Id has their correponding series of consecutive numbers in the column "Number". For example:
Id 400 contains a series of length 4 {1, 2, 3, 4} and another of length 2 {8, 9}
I´d like to obtain for each Id, the average length of their corresponding series.
In this example:
df2 = {'ID':[400, 500], 'avg_length':[3, 2]}
Any ideas will be much appreciated!
python pandas pandas-groupby group-summaries
I have a dataframe with a column containing ids and other column containing numbers:
df1 = {'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]}
You may note that each Id has their correponding series of consecutive numbers in the column "Number". For example:
Id 400 contains a series of length 4 {1, 2, 3, 4} and another of length 2 {8, 9}
I´d like to obtain for each Id, the average length of their corresponding series.
In this example:
df2 = {'ID':[400, 500], 'avg_length':[3, 2]}
Any ideas will be much appreciated!
python pandas pandas-groupby group-summaries
python pandas pandas-groupby group-summaries
asked Nov 21 at 16:29
Facundo Iannello
182
182
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Here is one way, uses groupby twice,
df1['tmp'] = (df1.Number - df1.Number.shift() > 1).cumsum()
df1.groupby(['ID', 'tmp']).Number.count().groupby(level = 0).mean().reset_index(name = 'avg_length')
2.29 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ID avg_length
0 400 3
1 500 2
Option 2: Without using apply twice, still uses tmp column created earlier
df1.groupby('ID').tmp.apply(lambda x: x.value_counts().mean()).reset_index(name = 'avg_length')
2.25 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
up vote
1
down vote
groupby
+ cumsum
+ value_counts
You can use groupby
with a custom function:
df = pd.DataFrame({'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]})
def mean_count(x):
return (x - x.shift()).ne(1).cumsum().value_counts().mean()
res = df.groupby('ID')['Number'].apply(mean_count).reset_index()
print(res)
ID Number
0 400 3.0
1 500 2.0
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Here is one way, uses groupby twice,
df1['tmp'] = (df1.Number - df1.Number.shift() > 1).cumsum()
df1.groupby(['ID', 'tmp']).Number.count().groupby(level = 0).mean().reset_index(name = 'avg_length')
2.29 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ID avg_length
0 400 3
1 500 2
Option 2: Without using apply twice, still uses tmp column created earlier
df1.groupby('ID').tmp.apply(lambda x: x.value_counts().mean()).reset_index(name = 'avg_length')
2.25 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
up vote
1
down vote
accepted
Here is one way, uses groupby twice,
df1['tmp'] = (df1.Number - df1.Number.shift() > 1).cumsum()
df1.groupby(['ID', 'tmp']).Number.count().groupby(level = 0).mean().reset_index(name = 'avg_length')
2.29 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ID avg_length
0 400 3
1 500 2
Option 2: Without using apply twice, still uses tmp column created earlier
df1.groupby('ID').tmp.apply(lambda x: x.value_counts().mean()).reset_index(name = 'avg_length')
2.25 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Here is one way, uses groupby twice,
df1['tmp'] = (df1.Number - df1.Number.shift() > 1).cumsum()
df1.groupby(['ID', 'tmp']).Number.count().groupby(level = 0).mean().reset_index(name = 'avg_length')
2.29 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ID avg_length
0 400 3
1 500 2
Option 2: Without using apply twice, still uses tmp column created earlier
df1.groupby('ID').tmp.apply(lambda x: x.value_counts().mean()).reset_index(name = 'avg_length')
2.25 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here is one way, uses groupby twice,
df1['tmp'] = (df1.Number - df1.Number.shift() > 1).cumsum()
df1.groupby(['ID', 'tmp']).Number.count().groupby(level = 0).mean().reset_index(name = 'avg_length')
2.29 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ID avg_length
0 400 3
1 500 2
Option 2: Without using apply twice, still uses tmp column created earlier
df1.groupby('ID').tmp.apply(lambda x: x.value_counts().mean()).reset_index(name = 'avg_length')
2.25 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edited Nov 21 at 17:00
answered Nov 21 at 16:48
Vaishali
16.9k3927
16.9k3927
add a comment |
add a comment |
up vote
1
down vote
groupby
+ cumsum
+ value_counts
You can use groupby
with a custom function:
df = pd.DataFrame({'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]})
def mean_count(x):
return (x - x.shift()).ne(1).cumsum().value_counts().mean()
res = df.groupby('ID')['Number'].apply(mean_count).reset_index()
print(res)
ID Number
0 400 3.0
1 500 2.0
add a comment |
up vote
1
down vote
groupby
+ cumsum
+ value_counts
You can use groupby
with a custom function:
df = pd.DataFrame({'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]})
def mean_count(x):
return (x - x.shift()).ne(1).cumsum().value_counts().mean()
res = df.groupby('ID')['Number'].apply(mean_count).reset_index()
print(res)
ID Number
0 400 3.0
1 500 2.0
add a comment |
up vote
1
down vote
up vote
1
down vote
groupby
+ cumsum
+ value_counts
You can use groupby
with a custom function:
df = pd.DataFrame({'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]})
def mean_count(x):
return (x - x.shift()).ne(1).cumsum().value_counts().mean()
res = df.groupby('ID')['Number'].apply(mean_count).reset_index()
print(res)
ID Number
0 400 3.0
1 500 2.0
groupby
+ cumsum
+ value_counts
You can use groupby
with a custom function:
df = pd.DataFrame({'ID':[400, 400, 400, 400, 400, 400, 500, 500, 500, 500],
'Number':[1, 2, 3, 4, 8, 9, 22, 23, 26, 27]})
def mean_count(x):
return (x - x.shift()).ne(1).cumsum().value_counts().mean()
res = df.groupby('ID')['Number'].apply(mean_count).reset_index()
print(res)
ID Number
0 400 3.0
1 500 2.0
answered Nov 21 at 16:56
jpp
87k194999
87k194999
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53416534%2fhow-to-group-a-dataframe-and-summarize-over-subgroups-of-consecutive-numbers-in%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown