Elasticsearch Python: Search analyzer
up vote
0
down vote
favorite
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
query={
"query": {
"bool": {
"must": [
{"query_string": {"default_field": "data.event.title",
"query": row['title']}},
]}},
if len(raw_results) < 5:
raw_results.append(event)
else:
break
events_list =
if len (raw_results) > 0:
for item in raw_results:
event_id = item['_id']
record_score = item['_score']
event_source = item['_source']['data']['event']
event_source['event_id'] = event_id
event_source['score'] = record_score
events_list.append(event_source)
dframe2 = pd.DataFrame(events_list)
The output that I get is similar to below:
FK title event_id title score
QFf World Fair 2018 EfB world fair 44.2
QFf World Fair 2018 n77 world fair 44.1
QFf World Fair 2018 5cY world fair 2017 25.84
TFzs ZUGER MESSE NEQ Styles Messe 20.12
TFzs ZUGER MESSE mBc Hannover Messe 20.06
TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
elasticsearch sentence-similarity
add a comment |
up vote
0
down vote
favorite
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
query={
"query": {
"bool": {
"must": [
{"query_string": {"default_field": "data.event.title",
"query": row['title']}},
]}},
if len(raw_results) < 5:
raw_results.append(event)
else:
break
events_list =
if len (raw_results) > 0:
for item in raw_results:
event_id = item['_id']
record_score = item['_score']
event_source = item['_source']['data']['event']
event_source['event_id'] = event_id
event_source['score'] = record_score
events_list.append(event_source)
dframe2 = pd.DataFrame(events_list)
The output that I get is similar to below:
FK title event_id title score
QFf World Fair 2018 EfB world fair 44.2
QFf World Fair 2018 n77 world fair 44.1
QFf World Fair 2018 5cY world fair 2017 25.84
TFzs ZUGER MESSE NEQ Styles Messe 20.12
TFzs ZUGER MESSE mBc Hannover Messe 20.06
TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
elasticsearch sentence-similarity
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
query={
"query": {
"bool": {
"must": [
{"query_string": {"default_field": "data.event.title",
"query": row['title']}},
]}},
if len(raw_results) < 5:
raw_results.append(event)
else:
break
events_list =
if len (raw_results) > 0:
for item in raw_results:
event_id = item['_id']
record_score = item['_score']
event_source = item['_source']['data']['event']
event_source['event_id'] = event_id
event_source['score'] = record_score
events_list.append(event_source)
dframe2 = pd.DataFrame(events_list)
The output that I get is similar to below:
FK title event_id title score
QFf World Fair 2018 EfB world fair 44.2
QFf World Fair 2018 n77 world fair 44.1
QFf World Fair 2018 5cY world fair 2017 25.84
TFzs ZUGER MESSE NEQ Styles Messe 20.12
TFzs ZUGER MESSE mBc Hannover Messe 20.06
TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
elasticsearch sentence-similarity
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
query={
"query": {
"bool": {
"must": [
{"query_string": {"default_field": "data.event.title",
"query": row['title']}},
]}},
if len(raw_results) < 5:
raw_results.append(event)
else:
break
events_list =
if len (raw_results) > 0:
for item in raw_results:
event_id = item['_id']
record_score = item['_score']
event_source = item['_source']['data']['event']
event_source['event_id'] = event_id
event_source['score'] = record_score
events_list.append(event_source)
dframe2 = pd.DataFrame(events_list)
The output that I get is similar to below:
FK title event_id title score
QFf World Fair 2018 EfB world fair 44.2
QFf World Fair 2018 n77 world fair 44.1
QFf World Fair 2018 5cY world fair 2017 25.84
TFzs ZUGER MESSE NEQ Styles Messe 20.12
TFzs ZUGER MESSE mBc Hannover Messe 20.06
TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
elasticsearch sentence-similarity
elasticsearch sentence-similarity
edited 8 hours ago
asked yesterday
Narges
134
134
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401652%2felasticsearch-python-search-analyzer%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown