Preventing spaCy splitting paragraph numbers into sentences
up vote
0
down vote
favorite
I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:
text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'
I'm trying to force spaCy's sentence segmenter to not split the 3.
into a sentence of it's own.
At the moment, the following code returns three separate sentences:
nlp = spacy.load("en_core_web_sm")
text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)
This returns:
**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:
if token.text == r'd.':
doc[token.i+1].is_sent_start = False
This is doesn't seem to have any effect. Has anyone come across this problem before?
python nlp spacy sentence
add a comment |
up vote
0
down vote
favorite
I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:
text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'
I'm trying to force spaCy's sentence segmenter to not split the 3.
into a sentence of it's own.
At the moment, the following code returns three separate sentences:
nlp = spacy.load("en_core_web_sm")
text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)
This returns:
**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:
if token.text == r'd.':
doc[token.i+1].is_sent_start = False
This is doesn't seem to have any effect. Has anyone come across this problem before?
python nlp spacy sentence
What is the expected output?
– Chirag
Nov 22 at 13:56
While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:
text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'
I'm trying to force spaCy's sentence segmenter to not split the 3.
into a sentence of it's own.
At the moment, the following code returns three separate sentences:
nlp = spacy.load("en_core_web_sm")
text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)
This returns:
**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:
if token.text == r'd.':
doc[token.i+1].is_sent_start = False
This is doesn't seem to have any effect. Has anyone come across this problem before?
python nlp spacy sentence
I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:
text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'
I'm trying to force spaCy's sentence segmenter to not split the 3.
into a sentence of it's own.
At the moment, the following code returns three separate sentences:
nlp = spacy.load("en_core_web_sm")
text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)
This returns:
**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:
if token.text == r'd.':
doc[token.i+1].is_sent_start = False
This is doesn't seem to have any effect. Has anyone come across this problem before?
python nlp spacy sentence
python nlp spacy sentence
asked Nov 21 at 23:45
DanielH
889
889
What is the expected output?
– Chirag
Nov 22 at 13:56
While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24
add a comment |
What is the expected output?
– Chirag
Nov 22 at 13:56
While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24
What is the expected output?
– Chirag
Nov 22 at 13:56
What is the expected output?
– Chirag
Nov 22 at 13:56
While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24
While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Something like this?
text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""",
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")
Output:
**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?
**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?
Reference: span.merge()
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422012%2fpreventing-spacy-splitting-paragraph-numbers-into-sentences%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Something like this?
text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""",
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")
Output:
**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?
**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?
Reference: span.merge()
add a comment |
up vote
0
down vote
Something like this?
text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""",
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")
Output:
**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?
**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?
Reference: span.merge()
add a comment |
up vote
0
down vote
up vote
0
down vote
Something like this?
text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""",
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")
Output:
**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?
**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?
Reference: span.merge()
Something like this?
text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""",
"""4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
doc = nlp(i)
span = doc[0:5]
span.merge()
for sent in doc.sents:
print("****", sent.text)
print("n")
Output:
**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?
**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?
Reference: span.merge()
edited Nov 23 at 5:17
answered Nov 22 at 14:20
Chirag
1,126411
1,126411
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53422012%2fpreventing-spacy-splitting-paragraph-numbers-into-sentences%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What is the expected output?
– Chirag
Nov 22 at 13:56
While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look.
– fnl
Nov 23 at 10:24