IT Anawer: Hadoop: map/reduce from HDFS

I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file system (e.g. org.apache.hadoop.examples.Grep)

Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls. How should I configure the JobConf to read it ?

Thanks .

From stackoverflow

```
JobConf conf = new JobConf(getConf(), ...);
...
FileInputFormat.setInputPaths(conf, new Path("stored.xls"))
...
JobClient.runJob(conf);
...
```
setInputPaths will do it.

Pierre : Thanks, but it throws an exception saying that "file:/home/me/workspace/HADOOP/stored.xls" (this is a local path) doesn't exist. The file on HDFS is in '/user/me/stored.xls'. I also tried new Path("/user/me/stored.xls") and it doesn't work too.

yogman : First off, it's strange that Hadoop complained about "file:" rather than "hdfs:". If might be that your hadoop-site.xml is misconfigured. And, second, if that still doesn't work, mkdir input and put stored.xls in the "input" dir (all with bin/hadoop fs command). And, new Path("input") instead of new Path("stored.xls")

yogman : Revealing your command line to run the job wouldn't hurt.
Pierre, the default configuration for Hadoop is to run in local mode, rather than in distributed mode. You likely need to just modify some configuration in your hadoop-site.xml. It looks like your default filesystem is still localhost, when it should be hdfs://youraddress:yourport. Look at your setting for fs.default.name, and also see the setup help at Michael Noll's blog for more details.
FileInputFormat.setInputPaths(conf, new Path("hdfs://hostname:port/user/me/stored.xls"));

This will do

IT Anawer

Friday, May 6, 2011

Hadoop: map/reduce from HDFS

0 comments:

Post a Comment

Blog Archive