Preventing spaCy splitting paragraph numbers into sentences

up vote
0
down vote

favorite

I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:

text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'

I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.

At the moment, the following code returns three separate sentences:

nlp = spacy.load("en_core_web_sm")



text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""

doc = nlp(text)

for sent in doc.sents:

    print("****", sent.text)

This returns:

**** 3.

**** English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:

if token.text == r'd.':

    doc[token.i+1].is_sent_start = False

This is doesn't seem to have any effect. Has anyone come across this problem before?

asked Nov 21 at 23:45

DanielH

889

What is the expected output?
– Chirag
Nov 22 at 13:56

While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24

add a comment |

up vote
0
down vote

favorite

I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:

text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'

I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.

At the moment, the following code returns three separate sentences:

nlp = spacy.load("en_core_web_sm")



text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""

doc = nlp(text)

for sent in doc.sents:

    print("****", sent.text)

This returns:

**** 3.

**** English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:

if token.text == r'd.':

    doc[token.i+1].is_sent_start = False

This is doesn't seem to have any effect. Has anyone come across this problem before?

asked Nov 21 at 23:45

DanielH

889

What is the expected output?
– Chirag
Nov 22 at 13:56

While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24

add a comment |

up vote
0
down vote

favorite

I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:

text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'

I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.

At the moment, the following code returns three separate sentences:

nlp = spacy.load("en_core_web_sm")



text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""

doc = nlp(text)

for sent in doc.sents:

    print("****", sent.text)

This returns:

**** 3.

**** English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:

if token.text == r'd.':

    doc[token.i+1].is_sent_start = False

This is doesn't seem to have any effect. Has anyone come across this problem before?

asked Nov 21 at 23:45

DanielH

889

I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:

text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'

I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.

At the moment, the following code returns three separate sentences:

nlp = spacy.load("en_core_web_sm")



text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""

doc = nlp(text)

for sent in doc.sents:

    print("****", sent.text)

This returns:

**** 3.

**** English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:

if token.text == r'd.':

    doc[token.i+1].is_sent_start = False

This is doesn't seem to have any effect. Has anyone come across this problem before?

python nlp spacy sentence

asked Nov 21 at 23:45

DanielH

889

asked Nov 21 at 23:45

DanielH

889

asked Nov 21 at 23:45

DanielH

889

asked Nov 21 at 23:45

DanielH

889

asked Nov 21 at 23:45

DanielH

889

What is the expected output?
– Chirag
Nov 22 at 13:56

While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24

add a comment |

What is the expected output?
– Chirag
Nov 22 at 13:56

While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24

What is the expected output?
– Chirag
Nov 22 at 13:56

While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Something like this?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 

        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]

for i in text:

    doc = nlp(i)

    span = doc[0:5]

    span.merge()

    for sent in doc.sents:

        print("****", sent.text)

    print("n")

Output:

**** 3. English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

**** Are you upto something?





**** 4. It's hilarious and I think this can be more of a political moment.

**** Don't you think so?

**** Will Robots replace humans?

Reference: span.merge()

edited Nov 23 at 5:17

answered Nov 22 at 14:20

Chirag

1,126411

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422012%2fpreventing-spacy-splitting-paragraph-numbers-into-sentences%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Something like this?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 

        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]

for i in text:

    doc = nlp(i)

    span = doc[0:5]

    span.merge()

    for sent in doc.sents:

        print("****", sent.text)

    print("n")

Output:

**** 3. English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

**** Are you upto something?





**** 4. It's hilarious and I think this can be more of a political moment.

**** Don't you think so?

**** Will Robots replace humans?

Reference: span.merge()

edited Nov 23 at 5:17

answered Nov 22 at 14:20

Chirag

1,126411

add a comment |

up vote
0
down vote

Something like this?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 

        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]

for i in text:

    doc = nlp(i)

    span = doc[0:5]

    span.merge()

    for sent in doc.sents:

        print("****", sent.text)

    print("n")

Output:

**** 3. English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

**** Are you upto something?





**** 4. It's hilarious and I think this can be more of a political moment.

**** Don't you think so?

**** Will Robots replace humans?

Reference: span.merge()

edited Nov 23 at 5:17

answered Nov 22 at 14:20

Chirag

1,126411

add a comment |

up vote
0
down vote

Something like this?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 

        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]

for i in text:

    doc = nlp(i)

    span = doc[0:5]

    span.merge()

    for sent in doc.sents:

        print("****", sent.text)

    print("n")

Output:

**** 3. English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

**** Are you upto something?





**** 4. It's hilarious and I think this can be more of a political moment.

**** Don't you think so?

**** Will Robots replace humans?

Reference: span.merge()

edited Nov 23 at 5:17

answered Nov 22 at 14:20

Chirag

1,126411

Something like this?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 

        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]

for i in text:

    doc = nlp(i)

    span = doc[0:5]

    span.merge()

    for sent in doc.sents:

        print("****", sent.text)

    print("n")

Output:

**** 3. English law takes a dim view of stealing stuff from the shops.

**** Some may argue that this is a pity.

**** Are you upto something?





**** 4. It's hilarious and I think this can be more of a political moment.

**** Don't you think so?

**** Will Robots replace humans?

Reference: span.merge()

edited Nov 23 at 5:17

answered Nov 22 at 14:20

Chirag

1,126411

edited Nov 23 at 5:17

answered Nov 22 at 14:20

Chirag

1,126411

answered Nov 22 at 14:20

Chirag

1,126411

answered Nov 22 at 14:20

Chirag

1,126411

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut