The LinkedIn knowledge graph is used to provide input signals to machine learning models and data insight pipelines to fuel different LinkedIn features. I came across this post that provides an in-depth analysis of how LinkedIn knowledge graph is built.
What are the entities in LinkedIn knowledge graph?
The entities are members, jobs, titles, skills, companies, geographical locations, schools, etc. These entities and the relationships form the ontology of the professional world and are used to improve recommender systems, search, monetisation, and consumer products.
How is LinkedIn’s knowledge graph constructed?
The LinkedIn’s knowledge graph is created using user-generated content and data extracted from the internet (noisy and duplicates). The knowledge graph needs to scale as new members joined, new jobs posted, change of titles, skills, etc.
How is machine learning being used to build the knowledge graph?
Machine learning is used to construct entity taxonomy, perform inference on entity relationship, data representation, insight extraction, and interactive data acquisition from users. LinkedIn’s knowledge graph id a dynamic graph, meaning that new entities and relationships can be added and existing relationships can also be changed.
Describe the construction of entity taxonomy.
An entity taxonomy consists of the identity of an entity and its attributes. Entities are created in two ways:
Organic entities. These are generated by users and the attributes are created and maintained by users
Auto-created entities. Entities here can be users but it can also be skills or titles. By mining user profiles and utilising external data sources and human validations, we are able to create tens of thousands of skills, titles, locations, companies, etc. to which we can map users to
There are around 450 million members, 190 million job listings, 9 million companies, 600+ degrees, and so on.
There are two parts to entity attributes:
Relationships to other entities in a taxonomy
Characteristic features not in any taxonomy
For example, a company entity has attributes that refer to other entities such as members, skills, companies, etc. It also has attributes such as logo, revenue, URL, etc that does not refer to other entity in any taxonomy. The former relationships to other entities represent edges in the LinkedIn knowledge graph. The latter involves text extraction, data ingestion from search engine, data integration from external sources, and other crowdsourcing-based methods.
All the entity attributes have confidence scores, either computed by ML model or assigned to be 1 if it’s human verified.
How does LinkedIn clean up user-generated organic entities?
By using rules to identify inaccurate or problematic organic entities. Problematic entities could be entities that have meaningless names, incomplete attributes, stale content, and so on.
How does LinkedIn generates auto-created entities?
There are 4 stages:
Generate candidates. Each entity has a canonical name and entity candidates are common phrases in member profiles and job descriptions
Disambiguate entities. Developed a soft clustering algorithm to group phrases. A phrase can have different meanings. An ambiguous phrase can appear in multiple clusters and represent different entities
De-duplicate entities. Multiple phrases can represent the same entity if they are synonyms of each other. Using word embeddings, we can cluster words that are similar to each other and de-duplicate entities!
Translate entities into other languages. Linguistic experts manually translate top entities with high member coverages into other international languages
Describe the entity relationship inference.
Some entity relationships are generated by members and they are known as “explicit”. Some entity relationships are predicted and these are known as “inferred”. Not all explicit entity relationships are trustworthy as one of the problems is “member’s mistake”, where members map themselves to an incorrect entity.
We have a content processing framework to infer entity relationships. As an example, we used explicit skills entered by users to infer other skills that they might have. This is shown in the figure below. We train a binary classifier for each kind of entity relationship (belong to or not). Inferred relationships are also recommended to members to ask for their feedback. If users accept, the inferred relationships become explicit.
Describe data representation and insights discovery .
We can embed the knowledge graph into a latent space. As a result, the latent vector of an entity captures the semantics of multiple entity taxonomies and multiple entity relationships. This is shown in the diagram below where we embed the skills and titles into the same latent space, showcasing for each job title, which skills are closely related to it. With this knowledge graph and data representation, you can easily use it to discover new insights. For example, we can connect members to certain skills that are mapped to many different jobs.