Sunday, March 02, 2014

Cleaning UMLS data and Loading into Graph


The little UMLS ontology I am building needs to support two basic features in its user interface - findability and navigability. I now have a reasonable solution for the findability part, and I am planning to use Neo4j (a graph database) for the navigability part.

As before, the nodes are extracted from the MRCONSO table. The relationships between nodes are extracted from the MRREL table. Both SQL queries are shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
mysql> select CUI, STR from MRCONSO
...    where LAT = 'ENG'
...    into outfile '/tmp/cuistr.csv'
...    fields terminated by '\t'
...    lines terminated by '\n';
Query OK, 7871075 rows affected (27.03 sec)

mysql> select CUI1, CUI2, RELA from MRREL 
...    into outfile '/tmp/cuirel.csv' 
...    fields terminated by '\t'
...    lines terminated by '\n';
Query OK, 58024739 rows affected (1 min 17.78 sec)

The Neo4j community seems to have standardized on Michael Hunger's batch-import tool for loading data into Neo4j. It takes as input tab separated files for the nodes and relationships, and writes out the graph into an embedded database. The node file(s) should specify a nodeId, and one or more properties separated by tabs. The relationship file(s) should specify the start node, end node, relationship name, and zero or more relationship properties separated by tabs.

Since my node file (cuistr.csv) was normalized (one row per synonym), I needed to transform this file to a (cui, list(str)) format. I decided to use MRJob (a Python based Map-Reduce framework from Yelp that you can use to run your jobs on Hadoop and Amazon EMR, although I just ran them locally) to write a little Map-Reduce job to do this.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# syns_aggregator_job.py
from mrjob.job import MRJob

class SynsAggregatorJob(MRJob):
  """
  Groups unique synonyms by CUI. 
  Input format: (CUI,DESCR)
  Output format: (CUI,[DESCR,...])
  """

  def mapper(self, key, value):
    (cui, descr) = value.split("\t")
    yield cui, descr

  def reducer(self, key, values):
    uniqSyns = set()
    for value in values:
      uniqSyns.add(value)
    print "%s\t%s" % (key, list(uniqSyns))

if __name__ == "__main__":
  SynsAggregatorJob.run()

I ran this locally with the following command to aggregate the 7,871,075 records into an aggregated file cuistr_agg.csv with 2,880,385 records.

1
2
sujit@tsunami:umls$ python syns_aggregator_job.py \
    /path/to/cuistr.csv > /path/to/cuistr_agg.csv 

I also built another job to remove edges that referred to non-existent nodes. Notice that in the SQL I only retrieved English names (LAT='ENG'), and there is no corresponding filter on the MRREL query. This step is actually unnecessary because the batch-import tool checks and skips such rows, but I include it here anyway because it seems to me to be quite a nice way to remove non-existent rows without having to look up a dictionary. But if you are trying to replicate, you should skip doing this step.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# rels_filter_job.py
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.compat import get_jobconf_value

class RelsFilterJob(MRJob):
  """
  Removes records from CUIREL where either node in a relation
  does not exist in CUISYN. Needs to be run twice - first run
  removes one non-existent CUI, second run removes second non
  existent CUI.
  Input format: (CUI, SYN_LIST) - from CuiSynsJob OR
                (CUI1, REL, CUI2)   - from cuirels.csv
  Output format: (CUI1, CUI2, REL)
  """

  def mapper_init(self):
    self.cui_idx = int(get_jobconf_value("cui_idx"))

  def mapper(self, key, value):
    ncols = len(value.split("\t"))
    if ncols == 2:
      # from the output of SynsAggregatorJob
      (cui, payload) = value.split("\t")
      yield (cui, "XXX")
    else:
      # from cuirels
      cols = value.split("\t")
      yield (cols[self.cui_idx], value)

  def reducer(self, key, values):
    # if one of the records in the reduced set has value XXX 
    # then all the values (except the XXX one) are good
    include = False
    vallist = []
    for value in values:
      if value == 'XXX':
        include = True
        continue
      vallist.append(value)
    if include:
      for value in vallist:
       print value
    
if __name__ == "__main__":
  RelsFilterJob.run()

I ran the above job twice, first to remove relationship rows which had non-existent source CUIs, and the second to remove ones with non-existent target CUIs. The JobConf parameter specifies which CUI to check. Here are the commands:

1
2
3
4
5
6
7
8
sujit@tsunami:umls$ python rels_filter_job.py \
    --jobconf cui_idx=0 \
    /path/to/cuistr_agg.csv /path/to/cuirel.csv > \
    /path/to/cuirel_filt_left.csv
sujit@tsunami:umls$ python rels_filter_job.py \
    --jobconf cui_idx=1 \
    /path/to/cuirel_agg.csv /path/to/cuirel_filt_left.csv > \
    /path/to/cuirel_filt_right.csv

This resulted in a much less dramatic reduction from 58,024,739 records in the source cuirel.csv file to 58,021,093 records in the cuirel_filt_right.csv target. However, as mentioned above, this step is unnecessary (and time-consuming), we can provide the cuirel.csv file to batch-import and it will do the right thing.

Batch Import did not run for me out of the box. In order to make it run, I had to parse the instructions on the README file multiple times, as well as do several searches on Neo4j's mailing list on Google Groups. I describe below what I had to do to make it run for my data.

My input files are cuistr_agg.csv (for the nodes) and cuirel_filt_right.csv (for the relationships). I needed to put headers on both of them to indicate to batch-import what the property names were and which fields I should be able to look up. This is because internally Neo4j uses longs to refer to node IDs - since my unique key for a node is the CUI (a string field), it creates a Lucene index to map the CUI to the internal node ID. Here are the first 10 rows from both files, showing the headers - the empty space between fields are tabs. The header creates a Lucene index called "concepts" that maps the string field "CUI" to the internal Neo4j nodeID.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# cuistr_agg.csv
cui:string:concepts  syns
C0000005             ['(131)I-MAA', '(131)I-Macroaggregated Albumin']
C0000039             ['Dipalmitoylglycerophosphocholine', ...]
C0000052             ['1,4-alpha-D-Glucan:1,4-alpha-D-glucan ...]
C0000074             ['1 Alkyl 2 Acylphosphatidates', ...]
C0000084             ['1 Carboxyglutamic Acid', ...]
C0000096             ['Isobutyltheophylline', ...]
C0000097             ['Methylphenyltetrahydropyridine (substance)', ...]
C0000098             ['1 Methyl 4 phenylpyridine', ...]
C0000102             ['a- Naphthylamine', '1 Naphthylamine', ...]
...

# cuirel_filt_right.csv
cui:string:concepts  cui:string:concepts  rela
C0000039             C0000039             entry_version_of
C0000039             C0000039             has_entry_version
C0000039             C0000039             has_permuted_term
C0000039             C0000039             has_permuted_term
C0000039             C0000039             has_permuted_term
C0000039             C0000039             has_permuted_term
C0000039             C0000039             has_sort_version
C0000039             C0000039             has_sort_version
C0000039             C0000039             has_translation
...

To download and compile batch-import, run the following commands:

1
2
3
sujit@tsunami:Downloads$ git clone https://github.com/jexp/batch-import.git
sujit@tsunami:Downloads$ cd batch-import
sujit@tsunami:batch-import$ mvn clean compile assembly:single

My first attempt to run the importer just hung. I needed to add the following two properties to the batch.properties file supplied with batch-import.

1
2
3
4
5
6
7
# create lucene index "concepts" for exact lookup
batch_import.node_index.concepts=exact

# input CSVs don't have quoted fields. Apparently this speeds
# things up considerably since it allows use of a simpler CSV
# parser.
batch_import.csv.quotes=false

Finally, batch-import sets an upper limit on the length of a property value (possibly for performance) in Chunker.BUFSIZE. Since I was using JSON-ified lists for synonyms, this field can be very long and the import would fail until I set Chunker.BUFSIZE from 32*1024 to 128*1024. I had to rebuild the JAR (mvn assembly:single) after this change. The following command created my Neo4j database in target/db and loaded my two files into it.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
sujit@tsunami:batch-import$ java -server -Dfile.encoding=UTF-8 -Xmx4G \
    -jar target/batch-import-jar-with-dependencies.jar \
    target/db /path/to/cuistr_agg.csv /path/to/cuirel_filt_right.csv

Using Existing Configuration File
............................
Importing 2880384 Nodes took 117 seconds 
..................................................... 58903 ms for 10000000
..................................................... 194346 ms for 10000000
..................................................... 419848 ms for 10000000
..................................................... 274616 ms for 10000000
..................................................... 507095 ms for 10000000
..................................................... 
Importing 58021093 Relationships took 1764 seconds 

Total import time: 1901 seconds 

In order to verify that the database was built correctly, I copied the contents of target/db over to /var/lib/neo4j/data/graph.db, the data directory of a Neo4j installation I had installed using apt-get install. Unfortunately, there is a version mismatch, so the database was unreadable. To get the correct version of Neo4j, I looked at the POM file of batch-import (neo4j.version was set to 1.9) and found a tarball download of the same version here. Installation consisted of exploding the tarball and starting the server with bin/neo4j start).

1
2
3
sujit@tsunami:opt$ sudo tar xvzf neo4j-community-1.9.6-unix.tar.gz
sujit@tsunami:opt$ cd neo4j-community-1.9.6
sujit@tsunami:neo4j-community-1.9.6$ bin/neo4j start

The server exposes a Web Admin client (similar to the Solr Admin client) at port 7474 (http://localhost:7474/webadmin/). The dashboard shows 2,880,385 nodes, 3,375,083 properties, 58,021,093 relationships and 653 relationship types, which matches with what we put in.

Thats all I have for today. Next week I hope to learn more about Neo4j, specifically its Cypher Query Language, and see if I can model some common use-cases using it.

4 comments (moderated to prevent spam):

Chaitanya said...

What version of UMLS did you use ?

Sujit Pal said...

Hi Chaitanya, I downloaded the latest release around the same timeframe I wrote this post, so I would guess release 2014A.

Ritwik Banerjee said...

I was using MongoDB for some UMLS-based information extraction experiments, and things were slower than I wanted. So I started looking into Neo4j. And your page came up soon enough on Google's search results! I was wondering, however, whether you also made use of the MRSTY.RRF file when creating the nodes.

Sujit Pal said...

I didn't make use of MRSTY for this experiment, but it makes sense to associate STY with CONSO using CUI, and have the sty code as another node attribute. That way you can roll up CUIs into STYs and get an idea of aggregate relationships. I had plans to do that but never got around to it.