EXTRACT: Interactive Extraction of Metadata

All you need is to drag and drop the bookmarklet to your browser's bookmark bar. The EXTRACT bookmarklet and step-by-step visual instructions are available in the About tab.

⚠ Your bookmark bar may not be shown by default in your web browser. Please see "Where do I find the browser bookmark bar?" on how to enable it.

EXTRACT can be used in two ways:

Entity extraction: To extract entities, select the text of interest and click the bookmarklet. A summary popup like the one shown in "How can I use the EXTRACT popup for curation?" (i.e. next FAQ) will appear. This popup allows you to easily copy the information about the identified entities to the clipboard or save it to a tab-delimited file.

⚠ The selected text is limited to at most 1024 characters. If more text is selected, it will be truncated.
Full page tagging: If you click the bookmarklet without having selected any text, EXTRACT will process the complete page and highlight the entities identified in it. This can help you to more quickly identify the relevant parts of the page.

⚠ While the page look and feel will be kept as close to the original as possible, some functionality of the page, such as search boxes and scripts, may no longer work. To re-enable those, simply refresh the page.

The EXTRACT popup has the functionality explained below:

By hovering the mouse cursor over the text tags or the table rows you can visually inspect which words have been identified as which entities.

To allow you to easily collect annotations in tabular form, e.g. in an Excel spreadsheet, two buttons allow you to either copy the information to the clipboard or save it to a tab-delimited file. When doing so, the selected text and the address of the source webpage will also be included for provenance.

EXTRACT is capable of identifying:

Environment descriptive terms from Environment Ontology (such as desert, lagoon and forest)
Organism mentions from NCBI Taxonomy
Tissue terms from BRENDA Tissue Ontology
Disease mentions from Disease Ontology and the Mammalian Phenotype Ontology
Biological process, cellular component, and molecular function mentions from Gene Ontology
Small chemical molecule mentions from PubChem
Protein-coding and non-coding RNA (ncRNA) genes based on those contained supported by the STRING and RAIN resources respectively.

Click the pop-out icon (⇧) in the top-right corner to open a larger version in a separate browser window or tab.

The full-page view of the EXTRACT popup assists navigation when larger pieces of text have been processed or a high number of entities have been identified.

This functionality is particular useful if you prefer to first identify all the relevant sections within, for example, a full-text article and subsequently extract the annotations from all of them.

EXTRACT can only process HTML pages, however these do not have to be on the web.

If your text of interest is not in an HTML page, e.g. a Microsoft Word document or an Excel spreadsheet, you can save this as a local HTML file on your computer and open it in your browser. You can now process it with EXTRACT, just like any web page.

Alternatively, you can copy and paste the text of interest into the Demo EXTRACT form to process it. In this mode only the entity extraction is possible (not the full-page tagging).

PDF files viewed within the browser are still PDF and not HTML pages. Thus they cannot be processed directly by EXTRACT, but must first be converted to HTML files.

The annotation of microbial samples with rich metadata is an essential prerequisite to comparative data analysis, and it is crucial that its content, syntax and terminology are standardized.

To address this need the microbial and molecular ecology communities have initiated the development of standards, checklists and detailed guidelines for reporting sample metadata (see here and here). These include metadata like geographic location, date and time, sampling procedure and sampling environment. The value of quantitative parameters in numerical approaches is clear; however, qualitative descriptors too, have much to offer.

Well-developed descriptions of, for example, the environment from which a sample was collected are of particular interest as they can provide researchers with useful insights into a microbial community which often cannot feasibly be captured in quantitative form.

Even when such descriptions are present, they are of limited use if they exist in unstructured, free-text format. Annotation of a sample metadata with ontology terms (where applicable) complements free-text descriptions with semantically controlled descriptors, which avoid the confusion caused by synonyms and have explicitly defined relations to other terms.

The Environment Ontology (ENVO) for example, is specifically concerned with representing qualitative knowledge about environment types to capture the source environment of, for example, microbiomes and natural history museum specimens.

For more details on how to annotate a sample's environment with ENVO terms see the next section as well as the latest: Environment Ontology Annotation Guidelines

Whereas quantitative information about an environment (for example, pH and salinity) is easy to represent in a structured form, qualitative knowledge about the type of environment is less trivial.

The Environment Ontology (ENVO) is specifically designed to address the challenge of capturing the source environment of, for example, microbiomes and natural history museum specimens. ENVO terms fall into three distinct hierarchies, namely biome, environmental feature and environmental material. To fully describe the environment of a sample, the Genomics Standards Consortium recommends that the environment annotation of a sample should feature at least one term from each hierarchies of the ontology.

It is thus important that the user is able to easily follow up on the ENVO terms identified by EXTRACT. We thus provide a link for each term in the entities table, which allows the user to:

inspect the exact definition of an ENVO term and assess whether it matches the term meaning in the text
inspect the ENVO term hierarchy and find out the top level term that this term belongs too. This way it is possible to define whether a term is an environmental material, environmental feature or a biome so as to fill-in the corresponding metadata fields

For more details see the latest: Environment Ontology Annotation Guidelines

In addition to suggesting Environment Ontology terms to describe the environmental context of a sample, EXTRACT can suggest relevant terms from other ontologies for the annotation of host-associated and/or disease-related samples.

These terms are able to represent the organisms, tissues and disease mentioned. Organism terms can, for example, be used to describe the host in the case of an animal- or plant-associated samples. Tissue terms can provide further detail on the anatomical part of the host from which a sample was collected. Lastly disease terms can describe the health status of the individual from which a sample was taken.

Organism names are mapped to the NCBI Taxonomy database, tissue names to the BRENDA Tissue Ontology, and disease names to the Disease Ontology. For more details, see "Which types of entities can EXTRACT identify?"

Yes the full-page tagging can highlight entities within an entire article, thereby making it easy to spot the text sections that are most likely to contain information to curate.

To tag a full page, click the EXTRACT bookmarklet without having selected any text. The entities identified will then be highlighted within the text, using different colors for different entity types.

As an example, we tagged the page for the ant species: Anochetus grandidieri Forel, 1891 (Plazi taxonomic treatment).

In the screenshot of the tagged page, you can easily spot environments in which this ant species occurs (in bright green) and key anatomical key anatomical features of the species and its casts (in teal).

Yes, the full page tagging can be used to this end. See e.g. this screenshot of the annotated listing of metagenomics studies in the Genomes OnLine Database.

By tagging the list, you get a visual overview with:

A rough categorization of outdoor environment, host-associated or disease-related samples in a list
An indication of how informative the listed sample descriptions are
An idea of the benefit of integrating EXTRACT in in-house curation tools and workflows

For more details on the how to implement the last point see section "Can I invoke the EXTRACT tagger programmatically?".

EXTRACT is not performing Natural Language Processing in a manner that would allow you to automatically extract associations between entities. However, the full page tagging functionality can be useful for identifying the sections that contain certain types of associations (see section "Can EXTRACT suggest sections to study in a full-text article?" for more details).

Also, developers can access EXTRACT through its API to utilize it as the Named Entity Recognition component of a larger NLP system (see section "Can I invoke the EXTRACT tagger programmatically?").

EXTRACT supports the latest version of Google Chrome, Mozilla Firefox, Safari, Opera and Internet Explorer browsers. In Internet Explorer the possibility to save information as a tab-delimited file is not available. EXTRACT has not been tested in older browsers, and we thus cannot promise that it will function with these.

Your bookmark bar may not be shown by default in your web browser. Please use the relevant "View" menu option to enable it. (e.g. "View, Always Show Bookmarks Bar" in Google Chrome, "View, Toolbars, Bookmarks Toolbar" in Mozilla Firefox, "View, Show Favorites Bar" in Safari)

In case you have too many bookmarks in your bookmark bar and depending on the Drag and Drop position, EXTRACT might simply not be visible. Please try enlarging your browser window and/or use the "view more" button at the right-most end of the bookmark bar.

EXTRACT has been designed against and tested on range a of web pages relevant to metagenomics sample, biomedical record, biodiversity data curation. These include literature abstracts, full-text articles, Wikipedia entries, metagenomics projects pages, and centralized biodiversity knowledge resources. However, some complex HTML pages, such as Google Drive documents, are presently not supported by EXTRACT.

The direct "Copy to clipboard" functionality is a feature supported by the latest Google Chrome, Mozilla Firefox and Internet Explorer browsers

For other browsers, you may Copy and Paste the relevant text from the alert message (via your standard browser functionality) or use the "Save to file" function.

The EXTRACT download file (entities.tsv) uses UNIX-style newlines, which some editors on Windows (including Notepad) do not understand.

If this happens please use a different text editor, for example Notepad++, to view all terms in the retrieved file.

For more information please see a comparison of text editors in relation to the new line support in Wikipedia.

When clicked, EXTRACT tries to connect to its server and load its basic script. If for any reason (e.g. network connectivity or heavily loaded browser) this does not succeed within 6.5 seconds, EXTRACT reports that it is not available. The same happens when EXTRACT performs a full-page tagging and a server response is not received within 30 seconds.

Please try closing some of your browser tabs/windows (if too many are open). Also try again later to see if it is due to intermittent network problems.

If the problem persists do not hesitate to contact us () mentioning your browser, operating system and the address of the web page you are trying to process.

In case you are connected to the internet from behind a firewall, please see the next point.

EXTRACT offers its functionality via HTTPS encrypted communication and not via plain HTTP. If you are behind a firewall, you may need to contact your local system administrator to have access enabled for EXTRACT (https://extract.jensenlab.org). For further information do not hesitate to contact us. ()

No, EXTRACT does not store any of the text sent to the server. We do, however, collect access statistics to be able to document the usage of EXTRACT.

The EXTRACT bookmarklet provides you with a simple way to use EXTRACT on any page to highlight terms of interest and extract annotations in a structured form via the popup. Here we describe how, as a web developer or resource provider, you can provide this functionality to all users of your web pages, without users needing to install the bookmarklet.

When embedded in other web pages, EXTRACT can help suggest relevant terms to annotate samples with based on free-text descriptions within the page. For example, it could be embedded in the submission interface for a database, to help researchers annotate their own samples with appropriate terms during the submission process. However, it could equally well be used to support retroactive addition of structured metadata to existing database records, for which only free-text descriptions exist.

You may add the HTML code below in your own web pages to embed the EXTRACT popup. The example shows how the popup for a given piece of text can be shown within an iframe HTML element by using the ExtractPopup method from the API; parameter customization can tailor it to your needs. In a typical use case, this method would be invoked programmatically from JavaScript.

A summary of the arguments to invoke ExtractPopup is given below:

Parameter	Type	Content
document	required	the plain text to be tagged
entity_types	required	e.g. "-2+-25+-26+-27" the type identifiers stand for: 0>: Genes/proteins from a specific organism ("0>": the organism's NCBI Taxonomy identifier) -1: PubChem Compound identifiers -2: NCBI Taxonomy entries -21: Gene Ontology biological process terms -22: Gene Ontology cellular component terms -23: Gene Ontology molecular function terms -25: BRENDA Tissue Ontology terms -26: Disease Ontology terms -27: Environment Ontology terms (concatenate with "+" to use multiple)
url	optional	the URL of text source web page

Note: please also add "auto_detect=0", unless you would like identification of genes/proteins of the automatically detected organism to occur)

Note: HTTPS is also supported (use https://tagger.jensenlab.org/)

The document text used in the example above is taken from Forget et al., 2010, Geobiology (PubMed).

For further information feel free to contact us ().

In addition to the high-level ExtractPopup web method used in the previous section, EXTRACT offers a robust and fine-grained Application Programming Interface (API) to its named entity recognition engine. The core methods of this REST API are presented below:

GetEntities

GetEntities (http://tagger.jensenlab.org/GetEntities) returns the unique list of the entities identified in the document. The entities belong to the specified entity_types and the response follows the specified format.

Request:

http://tagger.jensenlab.org/GetEntities?document=Both+samples+were+dominated+by+Zetaproteobacteria+Fe+oxidizers.+This+group+was+most+abundant+at+Volcano+1,+where+sediments+were+richer+in+Fe+and+contained+more+crystalline+forms+of+Fe+oxides.&entity_types=-2+-25+-26+-27&format=tsv

Response:

Zetaproteobacteria	-2	580370
sediments	-27	ENVO:00002007
Volcano	-27	ENVO:00000247

Parameter	Type	Content
document	required	the plain or html-formatted text to be tagged
entity_types	required	e.g. "-2+-25+-26+-27" the type identifiers stand for: 0>: Genes/proteins from a specific organism ("0>": the organism's NCBI Taxonomy identifier) -1: PubChem Compound identifiers -2: NCBI Taxonomy entries -21: Gene Ontology biological process terms -22: Gene Ontology cellular component terms -23: Gene Ontology molecular function terms -25: BRENDA Tissue Ontology terms -26: Disease Ontology terms -27: Environment Ontology terms (concatenate with "+" to use multiple)
format	optional	"tsv" or "xml" (default)
auto_detect	optional	"0" or "1" (default)

An example Perl client that demonstrates the GetEntities functionality is available here (it's accompanying sample input file can be found here).

GetHTML

GetHTML (http://tagger.jensenlab.org/GetHTML) returns tagged the input HTML document. The tagged entities belong to the specified entity_types.

Request:

http://tagger.jensenlab.org/GetHTML?document=Both+samples+were+dominated+by+Zetaproteobacteria+Fe+oxidizers.+This+group+was+most+abundant+at+Volcano+1,+where+sediments+were+richer+in+Fe+and+contained+more+crystalline+forms+of+Fe+oxides.&entity_types=-2+-25+-26+-27

Response (*):

Both samples were dominated by Zetaproteobacteria Fe oxidizers. This group was most abundant at Volcano 1, where sediments were richer in Fe and contained more crystalline forms of Fe oxides.

(*: inspect the page source for the exact response contents)

Parameter	Type	Content
document	required	the plain or html-formatted text to be tagged
entity_types	required	e.g. "-2+-25+-26+-27" the type identifiers stand for: 0>: Genes/proteins from a specific organism ("0>": the organism's NCBI Taxonomy identifier) -1: PubChem Compound identifiers -2: NCBI Taxonomy entries -21: Gene Ontology biological process terms -22: Gene Ontology cellular component terms -23: Gene Ontology molecular function terms -25: BRENDA Tissue Ontology terms -26: Disease Ontology terms -27: Environment Ontology terms (concatenate with "+" to use multiple)
auto_detect	optional	"0" or "1" (default)

Note: please also add "auto_detect=0", unless you would like identification of genes/proteins of the automatically detected organism to occur (feature under development))

Note: HTTPS is also supported (use https://tagger.jensenlab.org/)

A queuing system on the EXTRACT server ensures the handling of multiple simultaneous requests. HTTP POST requests to the tagger are recommended instead of HTTP GET requests. The maximum HTTP POST request data size supported is 10MB.

A more detailed description of the tagger API is available here.

The document text used in the examples above is taken from Forget et al., 2010, Geobiology (PubMed).