Compare two non-matching lists and identify the row with maximum matching elements
Background
I've two lists (of lists), each created by reading data from two address tables.
The first element in each row is the unique identifier of the list row and the remaining elements are used for address comparison.
Each list would look somewhat like this:
List 1 (cli add)
['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA']
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA']
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']
List 2 (struct add)
['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA']
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA']
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA']
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']
Task Goal
My task is to compare the addresses from the test list to the other list, and flag all the matching and non matching records.
I am looping through each row in list 1 and comparing with each row of list 2, element wise. If all the elements pulled from list 1 row are found in any row from list 2, I mark that record as 'matching' and retain the row from list 2. Have been able to identify the completely matching records.
Problem Point
The real challenge is about the non matching rows. For the non matching records from list 1, I would want to identify the most closely matching row from list 2. e.g. if row from list 1 has matching elements in three rows from list 2, I would want to pick up the list 2 row which has the highest number of matching elements.
Expected Outcome
In the data shared above, from the list 1, second row (id 542) has complete match. But the other two records aren't yielding a complete match.
I want to be able to create a list of un-matching records and capture umatching elements:
[[comparison record from list 1],[Most matching record from list 2],[non-matching elements from list 1]]
So for above shared data, I need a new list (of lists) which looks something like:
[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],['BR']]
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],['70','RD','M6C2X8', 'YK']]
The code below gets me the results partially. I am not being able to find a way to filter the highest matching rows.
Code Snippet
Here is how I found the matching and non-matching records (list 1 is referred by cli_add_fnl and list 2 is struc_add_fnl). Have also figured the way to list the unmatched elements and count of matching elements. Just need a way to pull only the rows with max count for list element 1.
### Step 4 - Identifying the matching and non matching addresses ###
validated_addresses_all =
invalid_addresses_all =
for cli_add in cli_add_fnl:
comparison_cli_add=cli_add.copy()
#removing the id column from comparison
comparison_cli_add.pop(0)
for struct_add in struct_add_fnl:
matching_elements = [address_element for address_element in comparison_cli_add if address_element in struct_add]
#capture the matching records
if matching_elements == comparison_cli_add:
validated_addresses_all.append(cli_add)
else:
invalid_addresses_all.append(cli_add)
invalid_addresses_all.append(struct_add)
invalid_addresses_all.append(len(set(comparison_cli_add) & set(struct_add)))
invalid_addresses_all.append(nonmatching_elements)
#remove the duplicate entries
fnl_validated_addresses =
for add in validated_addresses_all:
if add not in fnl_validated_addresses:
fnl_validated_addresses.append(add)
python python-3.x
add a comment |
Background
I've two lists (of lists), each created by reading data from two address tables.
The first element in each row is the unique identifier of the list row and the remaining elements are used for address comparison.
Each list would look somewhat like this:
List 1 (cli add)
['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA']
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA']
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']
List 2 (struct add)
['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA']
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA']
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA']
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']
Task Goal
My task is to compare the addresses from the test list to the other list, and flag all the matching and non matching records.
I am looping through each row in list 1 and comparing with each row of list 2, element wise. If all the elements pulled from list 1 row are found in any row from list 2, I mark that record as 'matching' and retain the row from list 2. Have been able to identify the completely matching records.
Problem Point
The real challenge is about the non matching rows. For the non matching records from list 1, I would want to identify the most closely matching row from list 2. e.g. if row from list 1 has matching elements in three rows from list 2, I would want to pick up the list 2 row which has the highest number of matching elements.
Expected Outcome
In the data shared above, from the list 1, second row (id 542) has complete match. But the other two records aren't yielding a complete match.
I want to be able to create a list of un-matching records and capture umatching elements:
[[comparison record from list 1],[Most matching record from list 2],[non-matching elements from list 1]]
So for above shared data, I need a new list (of lists) which looks something like:
[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],['BR']]
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],['70','RD','M6C2X8', 'YK']]
The code below gets me the results partially. I am not being able to find a way to filter the highest matching rows.
Code Snippet
Here is how I found the matching and non-matching records (list 1 is referred by cli_add_fnl and list 2 is struc_add_fnl). Have also figured the way to list the unmatched elements and count of matching elements. Just need a way to pull only the rows with max count for list element 1.
### Step 4 - Identifying the matching and non matching addresses ###
validated_addresses_all =
invalid_addresses_all =
for cli_add in cli_add_fnl:
comparison_cli_add=cli_add.copy()
#removing the id column from comparison
comparison_cli_add.pop(0)
for struct_add in struct_add_fnl:
matching_elements = [address_element for address_element in comparison_cli_add if address_element in struct_add]
#capture the matching records
if matching_elements == comparison_cli_add:
validated_addresses_all.append(cli_add)
else:
invalid_addresses_all.append(cli_add)
invalid_addresses_all.append(struct_add)
invalid_addresses_all.append(len(set(comparison_cli_add) & set(struct_add)))
invalid_addresses_all.append(nonmatching_elements)
#remove the duplicate entries
fnl_validated_addresses =
for add in validated_addresses_all:
if add not in fnl_validated_addresses:
fnl_validated_addresses.append(add)
python python-3.x
You should look into using sets: docs.python.org/3.7/library/…
– Meow
Nov 22 at 18:22
If any of the answers has helped you, please accept them as answers to help close the question. Thanks!
– BernardL
Nov 22 at 21:04
add a comment |
Background
I've two lists (of lists), each created by reading data from two address tables.
The first element in each row is the unique identifier of the list row and the remaining elements are used for address comparison.
Each list would look somewhat like this:
List 1 (cli add)
['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA']
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA']
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']
List 2 (struct add)
['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA']
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA']
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA']
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']
Task Goal
My task is to compare the addresses from the test list to the other list, and flag all the matching and non matching records.
I am looping through each row in list 1 and comparing with each row of list 2, element wise. If all the elements pulled from list 1 row are found in any row from list 2, I mark that record as 'matching' and retain the row from list 2. Have been able to identify the completely matching records.
Problem Point
The real challenge is about the non matching rows. For the non matching records from list 1, I would want to identify the most closely matching row from list 2. e.g. if row from list 1 has matching elements in three rows from list 2, I would want to pick up the list 2 row which has the highest number of matching elements.
Expected Outcome
In the data shared above, from the list 1, second row (id 542) has complete match. But the other two records aren't yielding a complete match.
I want to be able to create a list of un-matching records and capture umatching elements:
[[comparison record from list 1],[Most matching record from list 2],[non-matching elements from list 1]]
So for above shared data, I need a new list (of lists) which looks something like:
[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],['BR']]
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],['70','RD','M6C2X8', 'YK']]
The code below gets me the results partially. I am not being able to find a way to filter the highest matching rows.
Code Snippet
Here is how I found the matching and non-matching records (list 1 is referred by cli_add_fnl and list 2 is struc_add_fnl). Have also figured the way to list the unmatched elements and count of matching elements. Just need a way to pull only the rows with max count for list element 1.
### Step 4 - Identifying the matching and non matching addresses ###
validated_addresses_all =
invalid_addresses_all =
for cli_add in cli_add_fnl:
comparison_cli_add=cli_add.copy()
#removing the id column from comparison
comparison_cli_add.pop(0)
for struct_add in struct_add_fnl:
matching_elements = [address_element for address_element in comparison_cli_add if address_element in struct_add]
#capture the matching records
if matching_elements == comparison_cli_add:
validated_addresses_all.append(cli_add)
else:
invalid_addresses_all.append(cli_add)
invalid_addresses_all.append(struct_add)
invalid_addresses_all.append(len(set(comparison_cli_add) & set(struct_add)))
invalid_addresses_all.append(nonmatching_elements)
#remove the duplicate entries
fnl_validated_addresses =
for add in validated_addresses_all:
if add not in fnl_validated_addresses:
fnl_validated_addresses.append(add)
python python-3.x
Background
I've two lists (of lists), each created by reading data from two address tables.
The first element in each row is the unique identifier of the list row and the remaining elements are used for address comparison.
Each list would look somewhat like this:
List 1 (cli add)
['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA']
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA']
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']
List 2 (struct add)
['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA']
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA']
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA']
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']
Task Goal
My task is to compare the addresses from the test list to the other list, and flag all the matching and non matching records.
I am looping through each row in list 1 and comparing with each row of list 2, element wise. If all the elements pulled from list 1 row are found in any row from list 2, I mark that record as 'matching' and retain the row from list 2. Have been able to identify the completely matching records.
Problem Point
The real challenge is about the non matching rows. For the non matching records from list 1, I would want to identify the most closely matching row from list 2. e.g. if row from list 1 has matching elements in three rows from list 2, I would want to pick up the list 2 row which has the highest number of matching elements.
Expected Outcome
In the data shared above, from the list 1, second row (id 542) has complete match. But the other two records aren't yielding a complete match.
I want to be able to create a list of un-matching records and capture umatching elements:
[[comparison record from list 1],[Most matching record from list 2],[non-matching elements from list 1]]
So for above shared data, I need a new list (of lists) which looks something like:
[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],['BR']]
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],['70','RD','M6C2X8', 'YK']]
The code below gets me the results partially. I am not being able to find a way to filter the highest matching rows.
Code Snippet
Here is how I found the matching and non-matching records (list 1 is referred by cli_add_fnl and list 2 is struc_add_fnl). Have also figured the way to list the unmatched elements and count of matching elements. Just need a way to pull only the rows with max count for list element 1.
### Step 4 - Identifying the matching and non matching addresses ###
validated_addresses_all =
invalid_addresses_all =
for cli_add in cli_add_fnl:
comparison_cli_add=cli_add.copy()
#removing the id column from comparison
comparison_cli_add.pop(0)
for struct_add in struct_add_fnl:
matching_elements = [address_element for address_element in comparison_cli_add if address_element in struct_add]
#capture the matching records
if matching_elements == comparison_cli_add:
validated_addresses_all.append(cli_add)
else:
invalid_addresses_all.append(cli_add)
invalid_addresses_all.append(struct_add)
invalid_addresses_all.append(len(set(comparison_cli_add) & set(struct_add)))
invalid_addresses_all.append(nonmatching_elements)
#remove the duplicate entries
fnl_validated_addresses =
for add in validated_addresses_all:
if add not in fnl_validated_addresses:
fnl_validated_addresses.append(add)
python python-3.x
python python-3.x
edited Nov 22 at 18:22
asked Nov 22 at 17:15
Sushant Vasishta
948
948
You should look into using sets: docs.python.org/3.7/library/…
– Meow
Nov 22 at 18:22
If any of the answers has helped you, please accept them as answers to help close the question. Thanks!
– BernardL
Nov 22 at 21:04
add a comment |
You should look into using sets: docs.python.org/3.7/library/…
– Meow
Nov 22 at 18:22
If any of the answers has helped you, please accept them as answers to help close the question. Thanks!
– BernardL
Nov 22 at 21:04
You should look into using sets: docs.python.org/3.7/library/…
– Meow
Nov 22 at 18:22
You should look into using sets: docs.python.org/3.7/library/…
– Meow
Nov 22 at 18:22
If any of the answers has helped you, please accept them as answers to help close the question. Thanks!
– BernardL
Nov 22 at 21:04
If any of the answers has helped you, please accept them as answers to help close the question. Thanks!
– BernardL
Nov 22 at 21:04
add a comment |
1 Answer
1
active
oldest
votes
This is one way to do it with ignoring position and the first item by comparing the values that are in adds
and struct_adds
and internally keeping a counter of the highest matches. As long there is a match it will update the counter and gets the index of the highest match else in the example below, it does nothing. Differences from item in add
and the highest matches are then compared.
The results are then appended accordingly to a list.
adds = [['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA'],
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']]
struct_adds = [ ['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']]
results =
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = [add_item in struct_add[1:] for add_item in add[1:]]
if matches.count(True) > match_count:
match_count = matches.count(True)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = [i for i in add[1:] if i not in highest_match]
results.append([add,highest_match,differences])
Or if you want to use set
operations, which should be more effecient as suggested in the comments you can replace the for
block with:
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = set(add[1:]) & set(struct_add[1:])
if len(matches) > match_count:
match_count = len(matches)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = list(set(add[1:]) - set(highest_match[1:]))
results.append([add,highest_match,differences])
Both yields the same results:
results
>>
[[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['BR']],
[['543',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BR',
'CANADA'],
['7H0044',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BC',
'CANADA'],
['BR']],
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['70', 'RD', 'M6C2X8', 'YK']]]
I should also add that in this example and also not to further complicate things, it will take the first highest match. This part is managed in the if
clause comparing the count of true matches must be more than the current count of matches.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435703%2fcompare-two-non-matching-lists-and-identify-the-row-with-maximum-matching-elemen%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is one way to do it with ignoring position and the first item by comparing the values that are in adds
and struct_adds
and internally keeping a counter of the highest matches. As long there is a match it will update the counter and gets the index of the highest match else in the example below, it does nothing. Differences from item in add
and the highest matches are then compared.
The results are then appended accordingly to a list.
adds = [['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA'],
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']]
struct_adds = [ ['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']]
results =
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = [add_item in struct_add[1:] for add_item in add[1:]]
if matches.count(True) > match_count:
match_count = matches.count(True)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = [i for i in add[1:] if i not in highest_match]
results.append([add,highest_match,differences])
Or if you want to use set
operations, which should be more effecient as suggested in the comments you can replace the for
block with:
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = set(add[1:]) & set(struct_add[1:])
if len(matches) > match_count:
match_count = len(matches)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = list(set(add[1:]) - set(highest_match[1:]))
results.append([add,highest_match,differences])
Both yields the same results:
results
>>
[[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['BR']],
[['543',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BR',
'CANADA'],
['7H0044',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BC',
'CANADA'],
['BR']],
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['70', 'RD', 'M6C2X8', 'YK']]]
I should also add that in this example and also not to further complicate things, it will take the first highest match. This part is managed in the if
clause comparing the count of true matches must be more than the current count of matches.
add a comment |
This is one way to do it with ignoring position and the first item by comparing the values that are in adds
and struct_adds
and internally keeping a counter of the highest matches. As long there is a match it will update the counter and gets the index of the highest match else in the example below, it does nothing. Differences from item in add
and the highest matches are then compared.
The results are then appended accordingly to a list.
adds = [['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA'],
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']]
struct_adds = [ ['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']]
results =
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = [add_item in struct_add[1:] for add_item in add[1:]]
if matches.count(True) > match_count:
match_count = matches.count(True)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = [i for i in add[1:] if i not in highest_match]
results.append([add,highest_match,differences])
Or if you want to use set
operations, which should be more effecient as suggested in the comments you can replace the for
block with:
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = set(add[1:]) & set(struct_add[1:])
if len(matches) > match_count:
match_count = len(matches)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = list(set(add[1:]) - set(highest_match[1:]))
results.append([add,highest_match,differences])
Both yields the same results:
results
>>
[[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['BR']],
[['543',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BR',
'CANADA'],
['7H0044',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BC',
'CANADA'],
['BR']],
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['70', 'RD', 'M6C2X8', 'YK']]]
I should also add that in this example and also not to further complicate things, it will take the first highest match. This part is managed in the if
clause comparing the count of true matches must be more than the current count of matches.
add a comment |
This is one way to do it with ignoring position and the first item by comparing the values that are in adds
and struct_adds
and internally keeping a counter of the highest matches. As long there is a match it will update the counter and gets the index of the highest match else in the example below, it does nothing. Differences from item in add
and the highest matches are then compared.
The results are then appended accordingly to a list.
adds = [['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA'],
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']]
struct_adds = [ ['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']]
results =
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = [add_item in struct_add[1:] for add_item in add[1:]]
if matches.count(True) > match_count:
match_count = matches.count(True)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = [i for i in add[1:] if i not in highest_match]
results.append([add,highest_match,differences])
Or if you want to use set
operations, which should be more effecient as suggested in the comments you can replace the for
block with:
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = set(add[1:]) & set(struct_add[1:])
if len(matches) > match_count:
match_count = len(matches)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = list(set(add[1:]) - set(highest_match[1:]))
results.append([add,highest_match,differences])
Both yields the same results:
results
>>
[[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['BR']],
[['543',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BR',
'CANADA'],
['7H0044',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BC',
'CANADA'],
['BR']],
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['70', 'RD', 'M6C2X8', 'YK']]]
I should also add that in this example and also not to further complicate things, it will take the first highest match. This part is managed in the if
clause comparing the count of true matches must be more than the current count of matches.
This is one way to do it with ignoring position and the first item by comparing the values that are in adds
and struct_adds
and internally keeping a counter of the highest matches. As long there is a match it will update the counter and gets the index of the highest match else in the example below, it does nothing. Differences from item in add
and the highest matches are then compared.
The results are then appended accordingly to a list.
adds = [['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['543', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BR', 'CANADA'],
['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA']]
struct_adds = [ ['7H0044', '234', '654', 'BELMONT', 'AVENUE', 'V8S3T4', 'VICTORIA', 'BC', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['7H0034', '217', 'BONNYMUIR', 'DRIVE', 'V7S1L4', 'WEST', 'VANCOUVER', 'BC', 'CANADA']]
results =
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = [add_item in struct_add[1:] for add_item in add[1:]]
if matches.count(True) > match_count:
match_count = matches.count(True)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = [i for i in add[1:] if i not in highest_match]
results.append([add,highest_match,differences])
Or if you want to use set
operations, which should be more effecient as suggested in the comments you can replace the for
block with:
for add in adds:
match_count = 0
match_index = 0
for idx,struct_add in enumerate(struct_adds):
matches = set(add[1:]) & set(struct_add[1:])
if len(matches) > match_count:
match_count = len(matches)
match_index = idx
if match_count == 0:
pass # no matches
else:
highest_match = struct_adds[match_index]
differences = list(set(add[1:]) - set(highest_match[1:]))
results.append([add,highest_match,differences])
Both yields the same results:
results
>>
[[['3', 'V8T5G2', 'VICTORIA', 'BR', 'CANADA'],
['7H0033', 'V8T5G2', 'VICTORIA', 'BC', 'CANADA'],
['BR']],
[['543',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BR',
'CANADA'],
['7H0044',
'234',
'654',
'BELMONT',
'AVENUE',
'V8S3T4',
'VICTORIA',
'BC',
'CANADA'],
['BR']],
[['28', '70', 'RUSHTON', 'RD', 'M6C2X8', 'YK', 'ON', 'CANADA'],
['7H0001', '700', 'RUSHTON', 'ROAD', 'M6C2X7', 'YORK', 'ON', 'CANADA'],
['70', 'RD', 'M6C2X8', 'YK']]]
I should also add that in this example and also not to further complicate things, it will take the first highest match. This part is managed in the if
clause comparing the count of true matches must be more than the current count of matches.
edited Nov 22 at 18:34
answered Nov 22 at 18:22
BernardL
2,3381829
2,3381829
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435703%2fcompare-two-non-matching-lists-and-identify-the-row-with-maximum-matching-elemen%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You should look into using sets: docs.python.org/3.7/library/…
– Meow
Nov 22 at 18:22
If any of the answers has helped you, please accept them as answers to help close the question. Thanks!
– BernardL
Nov 22 at 21:04