未加星标

Finding Triplets with Neo4j

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二04 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

Finding Triplets with Neo4j

A user had an interesting Neo4j question on Stack Overflow the other day:

I have two types of nodes in my graph. One type is Testplan and the other is Tag. Testplans are tagged to Tags. I want most common pairs of Tags that share the same Testplans with a Tag having a specific name. I have been able to achieve the most common Tags sharing the same Testplan with one Tag, but getting confused when trying to do it for pairs of Tags.

Their cypher query looked like this:

MATCH (kw1:Tag)<-[e:TAGGED]-(tp1:Testplan)-[e2:TAGGED]->(kw2:Tag)
WHERE kw1.name = "result"
RETURN kw1,kw2,count(tp1)
ORDER BY count(tp1) DESC

…and their results were a little off:

Kw1 kw2 count(tp1)
“result” “error” 104
“result” “prerequisites” 89
“result” “alpha” 63

What they wanted was something closer to this:

Kw1 kw2 count(tp1)
“result” “error”,”prerequisites” 70
“result” “error”,”alpha” 63

The answer to their query was this:

MATCH (kw1:Tag)<-[e:TAGGED]-(tp1:Testplan)-[e2:TAGGED]->(kw2:Tag),
(tp1)-[:TAGGED]->(kw3:Tag)
WHERE kw1.name = "result"
AND ID(kw2)<ID(kw3)
RETURN kw2, kw3,count(tp1)
ORDER BY count(tp1) DESC

They were trying to group in pairs of kw2, but what they needed to do is add a second TAGGED relationship to a kw3, so every path would guarantee to have the original tag, kw2 and kw3. I am using the ID(kw2)<ID(kw3) trick to eliminate duplicates from returning as valid paths.

This satisfied the answer, but they ran into another problem:

in place of two if I want 3/4/5 such combinations, what should be the ideal way? I tried to tailor the query without understanding much, but this is making Neo4j crash even on a high performance computer.

Cypher is a pattern matching language and it is meant to deal with paths as the core unit of work. This works great for most things, but sometimes it means doing a lot more work than necessary. It’s not a problem, what we have to do is convert our cypher query to a stored procedure and try to be smarter about the work we are doing.

I’m calling this procedure “combinations” and we will pass in the first tag, the number of tag combinations (2 for doubles, 3 for triples, 4 for quads, etc) and a limit to only see the top sets of tags.

@Procedure(name = "com.maxdemarzi.combinations", mode = Mode.READ)
@Description("CALL com.maxdemarzi.combinations(tag, number, limit) - find combinations")
public Stream<MapResult> getCombinations(@Name("tag") String tag, @Name("size") Number k, @Name("limit") Number limit) throws IOException {

The first thing we’ll want to do is find the starting Tag node:

// We start by finding the tag
Node tagNode = db.findNode(Labels.Tag, "name", tag);

I need to keep a count of all the combinations we find, so we’ll use a Map. We have to use a dynamic object as our key because we’re not really sure how many tags we will asked to keep track of in the set.

// We will keep track of the combinations as a String of node ids
Map<String, Integer> counts = new HashMap<>();

With that out of the way, we want to traverse from our tag to all of the Test Plans connected to our starting Tag.

// Next we find all the test plans tagged by this tag
for (Relationship tagged : tagNode.getRelationships(Direction.INCOMING, RelationshipTypes.TAGGED)) {
Node testPlan = tagged.getStartNode();

For each Test Plan, we want to find all the tags that belong to it, and keep these in a list. Notice I’m using Relationship::getEndNodeId() to just grab the id of the node on the other end of our relationship instead of the node object.

// Then find all the tags for this test planArrayList<Long> tags = new ArrayList<>();for (Relationship taggedToo : testPlan.getRelationships(Direction.OUTGOING, RelationshipTypes.TAGGED)) { tags.add(taggedToo.getEndNodeId());}

Cypher traverses the graph checking for “Relationship Uniqueness”. In other words it does not traverse over the same relationship twice in a single path. We don’t want to include our starting Tag in our combinations because it will always be there and we know that already. So we can remove it from our tags list>

// We need to remove our starting tag
tags.remove(tagNode.getId()); Here is where things get interesting. We are going to generate all the possible combinations for the tags found connected to our test plan. If our tags had ids of [0, 2, 1], we will get a list of [0,1],[0,2],[1,2]. Notice it will be sorted, but I’m assuming the test plans do not get tagged by the same tag twice. We could use a Set if that was the case. // Get all the sorted combinations of tags
List<long[]> list = new ArrayList<>();
combinations(k.intValue() - 1 , tags.stream().sorted().mapToLong(l -> l).toArray(), list ); Now we have to add our list of combinations to our count. We’ll cheat and use Arrays::toString on the long[] to turn the [0,1] into an actual “[0,1]”. // For each combination add it to the counts map, or increment itfor (long[] item : list) { counts.merge(Arrays.toString(item), 1, (c, one) -> c + 1);}

Once we’ve done this for all the test plans, we sort the count:

// Sort the results in descending order
List<Map.Entry<String,Integer>> results = new ArrayList<>(counts.entrySet());
results.sort( Map.Entry.<String, Integer>comparingByValue().reversed() );

…and reduce the results to just the top x given to us by the limit variable.

// Get the top x results
results = results.subList(0,Math.min(results.size(), limit.intValue()));

The last piece of the puzzle is returning the results. To do this we will turn each into a map, using the key of each result entry to find the tag nodes and populate the names of the tags.

return results.stream().map(result -> {Map<String, Object> triple = new HashMap<>();int count = 1;for (String id : result.getKey().substring(1, result.getKey().length() - 1).replace(" ", "").split(",")) { Node found = db.getNodeById(Long.valueOf(id)); String name = (String)found.getProperty("name"); triple.put("tag" + count++, name);}triple.put("count", result.getValue());return new MapResult(triple);
});

So assuming we created some sample data, we could call the stored procedure like this:

CALL com.maxdemarzi.combinations(tag, number, limit)
CALL com.maxdemarzi.combinations('first tag', 3, 10);

…and our results would look like:

{
"tag1": "second tag",
"count": 4,
"tag2": "third tag"
}
{
"tag1": "third tag",
"count": 3,
"tag2": "fourth tag"
}
{
"tag1": "second tag",
"count": 3,
"tag2": "fourth tag"
}

To find quads instead of triples, we just change the middle variable:

CALL com.maxdemarzi.combinations('first tag', 4, 10);

As always the source code is on github. Is there a better way of doing this? Probably. This was just the first thing that came to mind. If you find a smarter way, send me a pull request!

…and yes I know it’s “triples” not “triplets” but aren’t those babies cute?

本文数据库(综合)相关术语:系统安全软件

主题: UT
tags: tag,gt,lt,combinations,our,count,Tag,result
分页:12
转载请注明
本文标题:Finding Triplets with Neo4j
本站链接:http://www.codesec.net/view/561405.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(45)