Preventing spaCy splitting paragraph numbers into sentences











up vote
0
down vote

favorite












I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:



text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'


I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.



At the moment, the following code returns three separate sentences:



nlp = spacy.load("en_core_web_sm")

text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)


This returns:



**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.


I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:



if token.text == r'd.':
doc[token.i+1].is_sent_start = False


This is doesn't seem to have any effect. Has anyone come across this problem before?










share|improve this question






















  • What is the expected output?
    – Chirag
    Nov 22 at 13:56










  • While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
    – fnl
    Nov 23 at 10:24

















up vote
0
down vote

favorite












I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:



text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'


I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.



At the moment, the following code returns three separate sentences:



nlp = spacy.load("en_core_web_sm")

text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)


This returns:



**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.


I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:



if token.text == r'd.':
doc[token.i+1].is_sent_start = False


This is doesn't seem to have any effect. Has anyone come across this problem before?










share|improve this question






















  • What is the expected output?
    – Chirag
    Nov 22 at 13:56










  • While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
    – fnl
    Nov 23 at 10:24















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:



text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'


I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.



At the moment, the following code returns three separate sentences:



nlp = spacy.load("en_core_web_sm")

text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)


This returns:



**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.


I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:



if token.text == r'd.':
doc[token.i+1].is_sent_start = False


This is doesn't seem to have any effect. Has anyone come across this problem before?










share|improve this question













I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:



text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'


I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.



At the moment, the following code returns three separate sentences:



nlp = spacy.load("en_core_web_sm")

text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)


This returns:



**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.


I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:



if token.text == r'd.':
doc[token.i+1].is_sent_start = False


This is doesn't seem to have any effect. Has anyone come across this problem before?







python nlp spacy sentence






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 21 at 23:45









DanielH

889




889












  • What is the expected output?
    – Chirag
    Nov 22 at 13:56










  • While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
    – fnl
    Nov 23 at 10:24




















  • What is the expected output?
    – Chirag
    Nov 22 at 13:56










  • While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
    – fnl
    Nov 23 at 10:24


















What is the expected output?
– Chirag
Nov 22 at 13:56




What is the expected output?
– Chirag
Nov 22 at 13:56












While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24






While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24














1 Answer
1






active

oldest

votes

















up vote
0
down vote













Something like this?



text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")


Output:



**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?


**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?


Reference: span.merge()






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422012%2fpreventing-spacy-splitting-paragraph-numbers-into-sentences%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Something like this?



    text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
    """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
    for i in text:
    doc = nlp(i)
    span = doc[0:5]
    span.merge()
    for sent in doc.sents:
    print("****", sent.text)
    print("n")


    Output:



    **** 3. English law takes a dim view of stealing stuff from the shops.
    **** Some may argue that this is a pity.
    **** Are you upto something?


    **** 4. It's hilarious and I think this can be more of a political moment.
    **** Don't you think so?
    **** Will Robots replace humans?


    Reference: span.merge()






    share|improve this answer



























      up vote
      0
      down vote













      Something like this?



      text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
      """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
      for i in text:
      doc = nlp(i)
      span = doc[0:5]
      span.merge()
      for sent in doc.sents:
      print("****", sent.text)
      print("n")


      Output:



      **** 3. English law takes a dim view of stealing stuff from the shops.
      **** Some may argue that this is a pity.
      **** Are you upto something?


      **** 4. It's hilarious and I think this can be more of a political moment.
      **** Don't you think so?
      **** Will Robots replace humans?


      Reference: span.merge()






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        Something like this?



        text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
        for i in text:
        doc = nlp(i)
        span = doc[0:5]
        span.merge()
        for sent in doc.sents:
        print("****", sent.text)
        print("n")


        Output:



        **** 3. English law takes a dim view of stealing stuff from the shops.
        **** Some may argue that this is a pity.
        **** Are you upto something?


        **** 4. It's hilarious and I think this can be more of a political moment.
        **** Don't you think so?
        **** Will Robots replace humans?


        Reference: span.merge()






        share|improve this answer














        Something like this?



        text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
        for i in text:
        doc = nlp(i)
        span = doc[0:5]
        span.merge()
        for sent in doc.sents:
        print("****", sent.text)
        print("n")


        Output:



        **** 3. English law takes a dim view of stealing stuff from the shops.
        **** Some may argue that this is a pity.
        **** Are you upto something?


        **** 4. It's hilarious and I think this can be more of a political moment.
        **** Don't you think so?
        **** Will Robots replace humans?


        Reference: span.merge()







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 23 at 5:17

























        answered Nov 22 at 14:20









        Chirag

        1,126411




        1,126411






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422012%2fpreventing-spacy-splitting-paragraph-numbers-into-sentences%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Berounka

            Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

            Sphinx de Gizeh