-
Notifications
You must be signed in to change notification settings - Fork 12
Exploring Files
Log into newton. Make a directory for this class.
$ mkdir lesson2
$ cd lesson2
Copy in a file to use for the lesson
$ cp /lustre/projects/staton/teaching/epp622/2/shell-novice-data.zip .
$ cp /lustre/projects/staton/teaching/epp622/2/green_ash.tar.gz .
$ cp /lustre/projects/staton/teaching/epp622/2/mangled_reads/GA3_S3.R2.trimmed.paired.fastq.gz .
To see the sizes of all files in the directory (as well as a lot of other info), use ls -l
$ ls -l
To see it in human-readable format, use ls -lh
$ ls -lh
This will only show you the size of the folder record itself, not the contents of the folder. To see the sizes of the folders including their contents
$ du -skh *
You will commonly run into files that are compressed. The ending of the filename will tell you the type of compression and, based on that information, you can run a command line utility to decompress. The most common compression types: file.gz, file.tar.gz, file.zip
This file has been compressed by zip (as indicated by the .zip extension). We need to un-zip it in order to read or use it:
$ unzip shell-novice-data.zip
$ ls
Note that the zip file does not disappear. Its still available.
Another common type of compressed file uses gzip. Gzip is a better version of zip (and you'll see it much more often). Extension is .gz.
$ gunzip GA3_S3.R2.trimmed.paired.fastq.gz .
$ ls
Note that the gzip file DOES disappear.
Another common type of compressed file uses both tar and gzip. Tar is a way to collect files and directories into one "archive". When used together, the result file is called something like archive.tar.gz
$ tar xvzf green_ash.tar.gz
The "xvzf" part of the above command are flags (even though they don't start with a -).
x = extract
v = verbose
z = use gzip
f = use the file I'm about to give you (instead of another type of input).
The tar command can also be used to create a zip file.
$ tar cvzf data.tar.gz data/
Its polite (but not enforced) to:
- put everything into a single directory prior to using tar/gzip (so files don't explode all over the place into your current directory)
- name your tar file the same thing as the directory (IE green_ash.tar.gz creates a directory called green_ash)
Now that we know a few basic commands,
we can finally look at the shell's most powerful feature:
the ease with which it lets us combine existing programs in new ways.
We'll start with a directory called molecules
that contains six files describing some simple organic molecules.
The .pdb
extension indicates that these files are in Protein Data Bank format,
a simple text format that specifies the type and position of each atom in the molecule.
Move into the data directory
$ cd data
$ cd users/nelle/molecules
$ ls
Look at what files are in the molecule folder
cubane.pdb ethane.pdb methane.pdb
octane.pdb pentane.pdb propane.pdb
Let's go into that directory with cd
and run the command wc *.pdb
.
wc
is the "word count" command:
it counts the number of lines, words, and characters in files.
The *
in *.pdb
matches zero or more characters,
so the shell turns *.pdb
into a complete list of .pdb
files:
$ cd molecules
$ wc *.pdb
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
The first column is the number of lines, the second is the number of words and the third is the number of characters.
*
is a wildcard. It matches zero or more characters, so*.pdb
matchesethane.pdb
,propane.pdb
, and every file that ends with '.pdb'. On the other hand,p*.pdb
only matchespentane.pdb
andpropane.pdb
, because the 'p' at the front only matches filenames that begin with the letter 'p'.We can use any number of wildcards at a time: for example,
p*.p*
matches anything that starts with a 'p' and contains a '.p'. Thus,p*.p*
would matchpreferred.practice
, and evenp.pi
(since the first '*' can match no characters at all), but notquality.practice
(doesn't start with 'p').
If we run wc -l
instead of just wc
,
the output shows only the number of lines per file:
$ wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
We can also use -w
to get only the number of words,
or -c
to get only the number of characters.
Which of these files is shortest? It's an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:
$ wc -l *.pdb > lengths.txt
The greater than symbol, >
, tells the shell to redirect the command's output
to a file instead of printing it to the screen.
The shell will create the file if it doesn't exist,
or overwrite the contents of that file if it does.
(This is why there is no screen output:
everything that wc
would have printed has gone into the file lengths.txt
instead.)
ls
confirms that the file exists:
$ ls
We can now send the content of lengths.txt
to the screen using cat lengths.txt
.
cat
stands for "concatenate":
it prints the contents of files one after another.
There's only one file in this case,
so cat
just shows us what it contains:
$ cat lengths.txt
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
Now let's use the sort
command to sort its contents.
We will also use the -n flag to specify that the sort is
numerical instead of alphabetical.
This does not change the file;
instead, it sends the sorted result to the screen:
$ sort -n lengths.txt
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
We can put the sorted list of lines in another temporary file called sorted-lengths.txt
by putting > sorted-lengths.txt
after the command,
just as we used > lengths.txt
to put the output of wc
into lengths.txt
.
Once we've done that,
we can run another command called head
to get the first few lines in sorted-lengths.txt
:
$ sort -n lengths.txt > sorted-lengths.txt
$ head -1 sorted-lengths.txt
9 methane.pdb
Using the parameter -1
with head
tells it that
we only want the first line of the file;
-20
would get the first 20,
and so on. the head
command with no parameters prints 10 lines.
$ head sorted-lengths.txt
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
If there are less than 10 lines in the file, it will print all the lines and stop.
Tail is just like head, but returns the last lines of a file instead of the first. The tail
command with no parameters prints the last 10 lines.
$ tail -1 sorted-lengths.txt
107 total
The command
uniq
removes adjacent duplicated lines from its input. For example, if a filesalmon.txt
contains:coho coho steelhead coho steelhead steelhead
then
uniq salmon.txt
produces:coho steelhead coho steelhead
The command uniq
removes adjacent duplicated lines from its input.
For example, if a file salmon.txt
contains:
coho
coho
steelhead
coho
steelhead
steelhead
then uniq salmon.txt
produces:
coho
steelhead
coho
steelhead
Even once you understand what wc
, sort
, and head
do,
all those intermediate files make it hard to follow what's going on.
We can make it easier to understand by running sort
and head
together:
$ sort -n lengths.txt | head -1
9 methane.pdb
The vertical bar between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right. The computer might create a temporary file if it needs to, or copy data from one program to the other in memory, or something else entirely; we don't have to know or care.
We can use another pipe to send the output of wc
directly to sort
,
which then sends its output to head
:
$ wc -l *.pdb | sort -n | head -1
9 methane.pdb
This is exactly like a mathematician nesting functions like log(3x)
and saying "the log of three times x".
In our case,
the calculation is "head of sort of line count of *.pdb
".
Here's what actually happens behind the scenes when we create a pipe. When a computer runs a program --- any program --- it creates a process in memory to hold the program's software and its current state. Every process has an input channel called standard input. (By this point, you may be surprised that the name is so memorable, but don't worry: most Unix programmers call it "stdin". Every process also has a default output channel called standard output (or "stdout").
The shell is actually just another program. Under normal circumstances, whatever we type on the keyboard is sent to the shell on its standard input, and whatever it produces on standard output is displayed on our screen. When we tell the shell to run a program, it creates a new process and temporarily sends whatever we type on our keyboard to that process's standard input, and whatever the process sends to standard output to the screen.
Here's what happens when we run wc -l *.pdb > lengths.txt
.
The shell starts by telling the computer to create a new process to run the wc
program.
Since we've provided some filenames as parameters,
wc
reads from them instead of from standard input.
And since we've used >
to redirect output to a file,
the shell connects the process's standard output to that file.
If we run wc -l *.pdb | sort -n
instead,
the shell creates two processes
(one for each process in the pipe)
so that wc
and sort
run simultaneously.
The standard output of wc
is fed directly to the standard input of sort
;
since there's no redirection with >
,
sort
's output goes to the screen.
And if we run wc -l *.pdb | sort -n | head -1
,
we get three processes with data flowing from the files,
through wc
to sort
,
and from sort
through head
to the screen.
This simple idea is why Unix has been so successful.
Instead of creating enormous programs that try to do many different things,
Unix programmers focus on creating lots of simple tools that each do one job well,
and that work well with each other.
This programming model is called "pipes and filters".
We've already seen pipes;
a filter is a program like wc
or sort
that transforms a stream of input into a stream of output.
Almost all of the standard Unix tools can work this way:
unless told to do otherwise,
they read from standard input,
do something with what they've read,
and write to standard output.
The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way as well. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.
Using
>
to redirect a program's output will overwrite any existing file with the same name. To avoid this, you can use the >> operator. This will concatenate the output to the end of the file if it already exists.
There are two common tools allow you to download files from HTTP, HTTPS, FTP (i.e. the internet): one is wget, one is curl. We'll mostly use wget in this course.
wget http://www.hardwoodgenomics.org/sites/default/files/gSSRs/white_oak_hqSSR_seqs.fasta
If you need more information about a command (like what it does, what parameters it needs, curious about additional functionality), use the man command. man stands for "manual".
man wget
To exit, type the letter q. To scroll a single line use enter or the arrow keys. Also:
space bar = forward one page
f = forward one page
b = back one page
h = display help page
Search engines such as Google are another resource for looking up commands. ("Google Fu").
Another way to view a file other than cat is the command less. This works a lot like the man pages. Try out
less white_oak_hqSSR_seqs.fasta
The same commands (q to exit, f for forward one page, b for back one page, etc) will work while using less.
All files and directories have an owner and a group. When you create a file or make a copy of a file to your own workspace, you are (usually) the owner. You can check this with ls -l
.
To view permissions, you can also use the information from ls -l
. This gives you the permissions in a 9 letter format, such as 'rwxrwxrwx' or 'r--r--r--'. The first three letters indicate permissions for the file owner (r for read, w for write and x for execute). The next three letters indicate the permissions for the group, and the final three letters for the world.
Lets add write permissions to the pdb files.
chmod a+w *pdb
ls -l
This is dangerous, as now anyone on newton could overwrite your files! Lets change it back to no write permissions.
chmod a-w *pdb
ls -l
If you need to edit a file, you can give only yourself (not everyone) write permissions
chmod u+w cubane.pdb
ls -l
This material is made available under the Creative Commons Attribution license. Much of the material has been adapted from lessons by Software Carpentry.