Disqus Comments

- −
- +
Sanjay Gupta 3 years ago
Hi Michael,
Great tutorial ...
One question, as you mentioned that hadoop does the file sorting and splitting. In the example it the split of the map output file is done across the same word then the reduce will have two entries of this word. Does hadoop takes care of such detail when he does the split of final map?
example
w 1
w 1
------- (if the file split is done here, them in the final reduced output file will have "w 2" followed by "w 3")
w 1
w 1
w 1
x 1
x 1
....
see more
41 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
janardhan 3 years ago
i got a problem in map reduce on python code... error is shown below...
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.0.jar -file /home/hduser/hadoop/mapper.py -mapper /home/hduser/hadoop/mapper.py -file /home/hduser/hadoop/reducer.py -reducer /home/hduser/hadoop/reducer.py -input /home/hduser/gutenberg/* -output /home/hduser/output3Warning: $HADOOP_HOME is deprecated. packageJobJar: [/home/hduser/hadoop/mapper.py, /home/hduser/hadoop/reducer.py, /app/hadoop/tmp/hadoop-unjar2090300167280691382/] [] /tmp/streamjob2369339998637272450.jar tmpDir=null 12/04/09 13:58:30 INFO mapred.FileInputFormat: Total input paths to process : 2 12/04/09 13:58:30 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] [...snipp...] 12/04/09 13:59:09 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201204091339_0004_m_000000 12/04/09 13:59:09 INFO streaming.StreamJob: killJob... Streaming Job Failed!
see more
32 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Chandrakant 2 years ago
Michael,
I cant thank you enough for your single-cluster tutorial and this one. I am a complete newcomer to Hadoop and I was able to get running in a few hours, thanks to you!
Also, minor nitpick - just wanted to point out the map reduce programs could be shorter if you used the collections.Counter object provided by the python standard library. Here's a working solution that I used:
mapper.py
import sys def run_map(f): for line in f: data = line.rstrip().split() for word in data: print(word) if __name__ == '__main__': run_map(sys.stdin)
reducer.py
import sys from collections import Counter def run_reduce(f): cnt = Counter() for line in f: data = line.rstrip() cnt[data] += 1 for word, count in cnt.items(): print(word,':',count) if __name__ == '__main__': run_reduce(sys.stdin)
Thanks again,
Chandrakant
see more
13 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Anuj 3 years ago
Hi,
I am getting the following Error.. Any suggestion would be highly helpful..
Edited by Michael G. Noll: I have moved your long logging output to https://gist.github.com/158799....
lrmraxm:hadoop-0.20.2-cdh3u2 anuj.maurice$ bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar -file /Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/python/mapper.py -mapper mapper.py -file /Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/python/reducer.py -reducer reducer.py -input /oos.txt -output /oos_new packageJobJar: [/Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/python/mapper.py, /Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/python/reducer.py, /tmp/hadoop-anuj.maurice/hadoop-unjar2426556812178658809/] [] /var/folders/Yu/YuXibLtIHOuWcHsjWu8zM-Ccvdo/-Tmp-/streamjob4679204253733026415.jar tmpDir=null [...snip...] 12/01/04 12:03:04 INFO streaming.StreamJob: map 100% reduce 100% 12/01/04 12:03:04 INFO streaming.StreamJob: To kill this job, run: 12/01/04 12:03:04 INFO streaming.StreamJob: /Users/anuj.maurice/Downloads/hadoop-0.20.2-cdh3u2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201201041122_0004 12/01/04 12:03:04 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201201041122_0004 12/01/04 12:03:04 ERROR streaming.StreamJob: Job not successful. Error: NA 12/01/04 12:03:04 INFO streaming.StreamJob: killJob... Streaming Command Failed!
see more
12 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link

Hi Michael,

I was trying to execute this streaming job example. I am getting the following error while i am running this program.

hduser@ip-xxx-xxx-xxx-xxx:/usr/local/hadoop/conf$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/custom_scripts/mapper.py -reducer /home/hduser/custom_scripts/reducer.py -input /user/hduser/input2/* -output /user/hduser/output_17_4_4
[...]
13/04/17 06:48:16 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201304170627_0003_m_000000
13/04/17 06:48:16 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

Can you please suggest any solution?

- −
- +
Krishna 3 years ago
There is a comment in the reducer program that the output of the mapper is sorted by key. Is this really relevant, because isnt the reducer supposed to get all the key value pairs with the same key or am I missing something here?
Why can't we simply sum up the values of all the key value pairs that come into a single reducer?
see more
9 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
be_fair 2 years ago
Great job Michael. I am a java developer and have never worked on python before. New to hadoop as well. I have gained a lot through your tutorials. I did whatever you suggested to do and it worked like a charm. I then thought of applying the code to a tab delimited file. I changed your mapper to the following.
#!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split('\t') # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print >>sys.stdout, '%s\t%s' % (word, 1)
Checked the code through
echo -e "foo\tfoo\t quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py
Got the following result
foo 2 quux labs foo bar quux 1
Then, I ran the map reduce job as suggested for a small file containing tab delimited data.
13/03/14 15:40:03 INFO streaming.StreamJob: map 0% reduce 0% 13/03/14 15:40:14 INFO streaming.StreamJob: map 100% reduce 0% 13/03/14 15:40:24 INFO streaming.StreamJob: map 100% reduce 33% 13/03/14 15:40:28 INFO streaming.StreamJob: map 100% reduce 0% 13/03/14 15:40:37 INFO streaming.StreamJob: map 100% reduce 33% 13/03/14 15:40:42 INFO streaming.StreamJob: map 100% reduce 0% 13/03/14 15:40:51 INFO streaming.StreamJob: map 100% reduce 33% 13/03/14 15:40:55 INFO streaming.StreamJob: map 100% reduce 0% 13/03/14 15:41:04 INFO streaming.StreamJob: map 100% reduce 33% 13/03/14 15:41:09 INFO streaming.StreamJob: map 100% reduce 0% 13/03/14 15:41:12 INFO streaming.StreamJob: map 100% reduce 100% 13/03/14 15:41:12 INFO streaming.StreamJob: To kill this job, run: 13/03/14 15:41:12 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201303121325_0044 13/03/14 15:41:12 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201303121325_0044 13/03/14 15:41:12 ERROR streaming.StreamJob: Job not successful. Error: # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201303121325_0044_r_000000 13/03/14 15:41:12 INFO streaming.StreamJob: killJob... Streaming Command Failed!
The command runs fine, if I replace the line words = line.split('\t') with line.split().
The output error on job report shows that
"java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1" alongwith other errors.
Thanks in advance for your help. Or somebody else on this site.
see more
7 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
tej 3 years ago
hey, i had a doubt
hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hduser/mapper.py
or
bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
can u tell me how i can get the input file name "20417-8.txt: in my mapper.py program..i am tryin to write inverted index program
i searched the Internet and ppl hav suggested to use os.environ["map_input_file"] but it doesn seem to work..
i am using hadoop-0.20.2 and python 2.6.6
plz help
see more
6 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Anurag Prajapat 2 years ago
great tutorial....gud for newbees who want to get their hand into hadoop!
see more
5 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Mehdi Boussarhane 2 years ago
thank you so much, very interesting blog.
i have one question, can we Writing an Hadoop MapReduce Program in OpenCV ?
or, i have an openCV program, i would like to use it in hadoop, is possible or not ?
see more
4 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Dipesh 2 years ago
Thank you for such a thorough tutorial.
I setup a small cluster using multiple virtual machines in my computer. When I run the map-reduce command, the map task is completed, but reduce task gets stuck. I checked and rechecked the python code. There seem not be any problem. Any suggestion why this might be happening?
see more
4 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Maria 2 years ago
Hi Noll,
How to parse and categorize system application log files in hadoop single node cluster.Is there any mapreduce coding???
see more
3 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Praveen 3 years ago
>>The job will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the results in a single result file in the HDFS directory /user/hduser/gutenberg-output.
Shouldn't it one file per reducer in the o/p?
see more
3 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Michael G. Noll Praveen 3 years ago
  @Praveen: Yes, in general it will be one file per reducer. In this example however the input files are so small so that it will be just a single file. But I'll clarify the relevant section.
  
  see more
  3 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link
- −
- +
jo 2 years ago
i love much thi
i like much this tutorial
a good job Michael
thank you
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Pily 2 years ago
Nithesh you can sort it afterwards.
bash# sort -k 2 -n -r part-00000|less
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Athar Noor Pily 2 years ago
  Can you please elaborate ? I get error on this line. How to use the command you suggested ? I want to sort the data on the descending order of count. Can you explain what k, 2, n, r, etc does ?
  
  see more
  2 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link
- −
- +
Rob Guilfoyle 2 years ago
This is hands down the best content I have come across yet for Map Reduce in the Hadoop environment. Very detailed and well written. Thanks so much for this!
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Hadoop map 3 years ago
Awesome post dude. This tutoril regarding hadoop is very helpful to me. Thanks a lot
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Nikhil 3 years ago
Thanks for the great set of tutorials on Hadoop, Michael.
I had a question. This is in the context of a distributed setup involving many nodes, and several large files stored on them with some replication (say three duplicate blocks per block). Now, when I run a standard hadoop streaming task like this one, and I don't specify values for the number of map and reduce tasks through mapred.*.tasks, what is the default behaviour like? Does it create some parallelism on its own or does it end up spawning a single task to get the job done?
Thanks again for the great articles.
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Thyag 3 years ago
It was a "wow" moment when I checked my part-00000 file!! Thanks for the nice tutorial
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Piyush Kansal 3 years ago
Dear Michael,
- I am writing my code in Python. So, can you please suggest how can we introduce cascading using Hadoop Streaming without actually using "Cascading" package
- Do I need to save intermediate files in this case
I tried searching this on internet but could not come up with a definite answer
I have the following scenario: Map1->Red1->Map2->Red2
see more
2 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
X.J ZHOU a year ago
very useful tutorial
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Pavel Odintsov 2 years ago
Hello,
Please fix link http://hadoop.apache.org/core/... to http://hadoop.apache.org/docs/... because first link is broken.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Michael G. Noll Owner Pavel Odintsov 2 years ago
  Thanks, odintsov. Fixed.
  
  see more
  1 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link
- −
- +
Chris Hayes 2 years ago
Python has worker threading with a Pool (http://docs.python.org/2/libra... object that has a 'map' function that can work for both the map and reduce functionality.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
john 2 years ago
Hello, Mike,
I was trying your code for the first easy mapper.py and reducer.py above.
For some reasons, when I do
echo "foo foo quux labs foo bar quux" | python /home/hduser/mapper.py | sort -k1, 1 | python reducer.py
no results come out. No error messages, either. I do not know what's wrong.
I am using hadoop 1.0.4.tar.gz on my unbunto 10.4.
Would you please advise me of how to fix this problem?
Thank you
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Dakota Reier john 2 years ago
  try to remove "python" from your code and just pipe into the script.py
  
  see more
  2 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link
  - - −
    - +
    Chuchun Kang Dakota Reier 3 months ago
    
    I change some code and run it smoothly! Maybe it will help you.
    
    View Hide
    
    see more
    
    0 You must sign in to down-vote this post.
    
    Reply
    
    Share ›
    
    Twitter
    
    Facebook
    
    Link
- −
- +
shiva krishna 2 years ago
I am using hadoop to process an xml file,so i had written mapper file , reducer file in python.
suppose the input need to process is **test.xml**.
**mapper.py** file
import sys import cStringIO import xml.etree.ElementTree as xml if __name__ == '__main__': buff = None intext = False for line in sys.stdin: line = line.strip() if line.find("<row") != -1: ............. ............. ............. print &#039%s\t%s&#039%(campaignID,adGroupID )
**reducer.py** file
import sys if __name__ == &#039__main__&#039: for line in sys.stdin: print line.strip()
I had run the hadoop with following command
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /path/to/input_file/test.xml -output /path/to/output_folder/to/store/file
When i run the above command hadoop is creating an output file at output path in the format we mentioned in `reducer.py` file correctly with required data.
Now after all what i am trying to do is, i dont want to store output data in a text file created as default by haddop when i run above command, instead i want to save the data in to a `MYSQL` database
so i had written some python code in `reducer.py` file that writes the data directly to `MYSQL` database , and tried to run the above command by removing the output path as below
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar - file /path/to/mapper.py file -mapper /path/to/mapper.py file -file /path/to/reducer.py file -reducer /path/to/reducer.py file -input /path/to/input_file/test.xml
And i am getting the error something like below
12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output Usage: $HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input DFS input file(s) for the Map step -output DFS output directory for the Reduce step -mapper The streaming command to run -combiner The streaming command to run -reducer The streaming command to run -file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. ......................... .........................
1. After all my doubt is how to save the data in `Database` after processing the files ?
2. In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database
3. which command is used to run hadoop for saving data in to database, becuase when i removed the output folder path in the hadoop command, it is showing an error.
Can anyone please help me to solve the above problem.............
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Fred Mailhot 2 years ago
In regards to my previous comment: it was the basic mapper.py and reducer.py, not the ones making use of iterators/generators...
FM.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Fred Mailhot 2 years ago
In re: timing...I ran the wordcount example on the 3 Gutenberg texts:
- using straight Hadoop (Java mapper & reducer, no streaming): 38s
- using mapper.py and reducer.py from above with Hadoop streaming: 44s
Not a very big timing hit at all.
FM.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
shabeera 2 years ago
very good tutorial
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Fabio Pedrazzoli 2 years ago
Thank you very much for your time for this tutorial Michael, it's super neat and very well explained, it went smooth as silk on Ubuntu 12.04.1 ; )
Cheers
Fabio
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Siamac 2 years ago
Excellent tutorial. Thank you v.m. Michael.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Jack Coughlin 2 years ago
Fantastic tutorial, thanks so much! What is your sense of the performance impact of using Hadoop Streaming versus a custom jar, over and above the impact of using an interpreted language like Python?
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Michael G. Noll Jack Coughlin 2 years ago
  @Jack Coughlin: Normally Hadoop Streaming is a little bit slower than native (Java) MapReduce jobs.
  
  see more
  1 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link
- −
- +
Word Counter 2 years ago
Thanks for the tutorial mate. I used this to help me make some awesome applications.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Darshan Hegde 2 years ago
I'm a newbee to hadoop, was trying out this example. Some how I'm getting the following error:
[root@localhost src]# hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-*streaming*.jar -file /data/hduser/src/mapper.py -mapper /data/hduser/src/mapper.py -file /data/hduser/src/reducer.py -reducer /data/hduser/src/reducer.py -input /data/hduser/gutenberg/* -output /data/hduser/gutenberg-output/ File: /data/hduser/src/mapper.py does not exist, or is not readable. Streaming Command Failed! [root@localhost src]# [root@localhost src]# hadoop dfs -ls /data/hduser/src Found 2 items -rw-r--r-- 1 cloudera supergroup 591 2012-07-07 15:27 /data/hduser/src/mapper.py -rw-r--r-- 1 cloudera supergroup 1129 2012-07-07 15:27 /data/hduser/src/reducer.py
But the path is correct. Can anybody please help ?
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Michael G. Noll Darshan Hegde 2 years ago
  @Darshan Hegde: The script file you specify with the <tt>-file</tt> and <tt>-reducer</tt> options must be a local file, not a file in HDFS.
  
  see more
  1 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link
- −
- +
rohan 2 years ago
Thanks a lot for such a brilliant tutorial .
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Hamish Drewry 2 years ago
A brilliant tutorial - perhaps one of the best I have come across.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Nithesh 3 years ago
Awesome Tutorial, Very well explained.
Left with a single question, How would I sort this output file in descending order of count? (word with the highest count appears first)
Any help is much appreciated.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Anil Kumar 3 years ago
The best and simple Hadoop tutorial . I had a very successful run on the first try itself on Ubuntu 11.04 server VMs running in OpenNebula. I automated most of configuration part through contextualization and it started looking much faster to setup and run.
Thanks a lot once again for the tutorial.
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Agila 3 years ago
hi any one help me...
i am working on a project,where i am searching a word in collection of file which is a simple python,i need to convert it into a parallel process using mapreduce (hadoop),,,,,,
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
jeff 3 years ago
disregard earlier posts by me -- works fine under the correct user.
thanks very much for this article!
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
jeff 3 years ago
Regarding my earlier question, could it simply be that such an early failure is due to Python not being installed on all of the nodes of the Hadoop cluster?
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
jeff 3 years ago
I am getting, as another poster (Anuj) was, the Streaming Command Failed! message. However mine happens apparently much sooner: I have not gotten beyond the following:
packageJobJar: [/home/jmiller/wordcount/mapper.py, /home/jmiller/wordcount/reducer.py, /tmp/hadoop-jmiller/hadoop-unjar8760597989207755800/] [] /tmp/streamjob1624565313346212981.jar tmpDir=null Streaming Command Failed!
It seems to me that the absence of the input directory or insufficient permissions might cause this failure, but the directory does exist in HDFS, the permission is rwx for everyone on the directory and its contents. Same thing with the output directory.
Could the input file be in the wrong format? Is there a place where more error info would be displayed?
Thanks,
Jeff
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- −
- +
Sachin 3 years ago
I want to write hadoop streaming job for map reduce, however I am not aware how to get filename where I am getting input line. How I can do that?
see more
1 You must sign in to down-vote this post.

Reply

Share ›

Twitter

Facebook

Link
- - −
  - +
  Michael G. Noll Sachin 3 years ago
  @Sachin: Hadoop sets job configuration parameters as environment variables when streaming is used. For instance, <tt>os.environ["mapred_job_id"]</tt> gives you the <tt>mapred.job.id</tt> configuration property. Off the top of my head the name of the input file will be in <tt>os.environ["map_input_file"]</tt>. Note that Hadopo replaces non-alphanumeric characters such as dots "." with underscores.
  See Tom White's book Hadoop: The Definitive Guide (2nd ed.), page 187 for more information.
  
  see more
  4 You must sign in to down-vote this post.
  
  Reply
  
  Share ›
  
  Twitter
  
  Facebook
  
  Link

Load more comments

in this conversation

Sign in with

or register with Disqus

Disqus is a conversation network

Also on Michael G. Noll

Integrating Kafka and Storm: Code Examples and State of the Game

Apache Storm 0.9 training deck and tutorial

Implementing Real-Time Trending Topics With A Rolling Count Algorithm in Storm

Benchmarking and Stress Testing an Hadoop Cluster with TeraSort, TestDFSIO & Co.