Friday, May 6, 2011

Hadoop: map/reduce from HDFS

I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file system (e.g. org.apache.hadoop.examples.Grep)

Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls. How should I configure the JobConf to read it ?

Thanks .

From stackoverflow
  • JobConf conf = new JobConf(getConf(), ...);
    ...
    FileInputFormat.setInputPaths(conf, new Path("stored.xls"))
    ...
    JobClient.runJob(conf);
    ...
    

    setInputPaths will do it.

    Pierre : Thanks, but it throws an exception saying that "file:/home/me/workspace/HADOOP/stored.xls" (this is a local path) doesn't exist. The file on HDFS is in '/user/me/stored.xls'. I also tried new Path("/user/me/stored.xls") and it doesn't work too.
    yogman : First off, it's strange that Hadoop complained about "file:" rather than "hdfs:". If might be that your hadoop-site.xml is misconfigured. And, second, if that still doesn't work, mkdir input and put stored.xls in the "input" dir (all with bin/hadoop fs command). And, new Path("input") instead of new Path("stored.xls")
    yogman : Revealing your command line to run the job wouldn't hurt.
  • Pierre, the default configuration for Hadoop is to run in local mode, rather than in distributed mode. You likely need to just modify some configuration in your hadoop-site.xml. It looks like your default filesystem is still localhost, when it should be hdfs://youraddress:yourport. Look at your setting for fs.default.name, and also see the setup help at Michael Noll's blog for more details.

  • FileInputFormat.setInputPaths(conf, new Path("hdfs://hostname:port/user/me/stored.xls"));

    This will do

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.