Skip to content

Exploring Files

Meg Staton edited this page Aug 24, 2016 · 8 revisions

Log into newton. Make a directory for this class.

$ mkdir lesson2
$ cd lesson2

Copy in a file to use for the lesson

$ cp /lustre/projects/staton/teaching/epp622/2/shell-novice-data.zip .
$ cp /lustre/projects/staton/teaching/epp622/2/green_ash.tar.gz .
$ cp /lustre/projects/staton/teaching/epp622/2/mangled_reads/GA3_S3.R2.trimmed.paired.fastq.gz .

File Sizes

To see the sizes of all files in the directory (as well as a lot of other info), use ls -l

$ ls -l

To see it in human-readable format, use ls -lh

$ ls -lh

This will only show you the size of the folder record itself, not the contents of the folder. To see the sizes of the folders including their contents

$ du -skh *

Compress/Decompress

You will commonly run into files that are compressed. The ending of the filename will tell you the type of compression and, based on that information, you can run a command line utility to decompress. The most common compression types: file.gz, file.tar.gz, file.zip

zip

This file has been compressed by zip (as indicated by the .zip extension). We need to un-zip it in order to read or use it:

$ unzip shell-novice-data.zip
$ ls	

Note that the zip file does not disappear. Its still available.

gzip and tar

Another common type of compressed file uses gzip. Gzip is a better version of zip (and you'll see it much more often). Extension is .gz.

$ gunzip GA3_S3.R2.trimmed.paired.fastq.gz .
$ ls	

Note that the gzip file DOES disappear.

gzip and tar

Another common type of compressed file uses both tar and gzip. Tar is a way to collect files and directories into one "archive". When used together, the result file is called something like archive.tar.gz

$ tar xvzf green_ash.tar.gz

The "xvzf" part of the above command are flags (even though they don't start with a -).

x = extract

v = verbose

z = use gzip

f = use the file I'm about to give you (instead of another type of input).

The tar command can also be used to create a zip file.

$ tar cvzf data.tar.gz data/

Its polite (but not enforced) to:

  • put everything into a single directory prior to using tar/gzip (so files don't explode all over the place into your current directory)
  • name your tar file the same thing as the directory (IE green_ash.tar.gz creates a directory called green_ash)

Now that we know a few basic commands, we can finally look at the shell's most powerful feature: the ease with which it lets us combine existing programs in new ways. We'll start with a directory called molecules that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.

Move into the data directory

$ cd data
$ cd users/nelle/molecules
$ ls

Look at what files are in the molecule folder

cubane.pdb    ethane.pdb    methane.pdb
octane.pdb    pentane.pdb   propane.pdb

wc

Let's go into that directory with cd and run the command wc *.pdb. wc is the "word count" command: it counts the number of lines, words, and characters in files. The * in *.pdb matches zero or more characters, so the shell turns *.pdb into a complete list of .pdb files:

$ cd molecules
$ wc *.pdb
  20  156 1158 cubane.pdb
  12   84  622 ethane.pdb
   9   57  422 methane.pdb
  30  246 1828 octane.pdb
  21  165 1226 pentane.pdb
  15  111  825 propane.pdb
 107  819 6081 total

The first column is the number of lines, the second is the number of words and the third is the number of characters.

Wildcards

* is a wildcard. It matches zero or more characters, so *.pdb matches ethane.pdb, propane.pdb, and every file that ends with '.pdb'. On the other hand, p*.pdb only matches pentane.pdb and propane.pdb, because the 'p' at the front only matches filenames that begin with the letter 'p'.

We can use any number of wildcards at a time: for example, p*.p* matches anything that starts with a 'p' and contains a '.p'. Thus, p*.p* would match preferred.practice, and even p.pi (since the first '*' can match no characters at all), but not quality.practice (doesn't start with 'p').

If we run wc -l instead of just wc, the output shows only the number of lines per file:

$ wc -l *.pdb
  20  cubane.pdb
  12  ethane.pdb
   9  methane.pdb
  30  octane.pdb
  21  pentane.pdb
  15  propane.pdb
 107  total

We can also use -w to get only the number of words, or -c to get only the number of characters.

Which of these files is shortest? It's an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:

$ wc -l *.pdb > lengths.txt

output redirect

The greater than symbol, >, tells the shell to redirect the command's output to a file instead of printing it to the screen. The shell will create the file if it doesn't exist, or overwrite the contents of that file if it does. (This is why there is no screen output: everything that wc would have printed has gone into the file lengths.txt instead.) ls confirms that the file exists:

$ ls 

cat

We can now send the content of lengths.txt to the screen using cat lengths.txt. cat stands for "concatenate": it prints the contents of files one after another. There's only one file in this case, so cat just shows us what it contains:

$ cat lengths.txt
  20  cubane.pdb
  12  ethane.pdb
   9  methane.pdb
  30  octane.pdb
  21  pentane.pdb
  15  propane.pdb
 107  total

sort

Now let's use the sort command to sort its contents. We will also use the -n flag to specify that the sort is numerical instead of alphabetical. This does not change the file; instead, it sends the sorted result to the screen:

$ sort -n lengths.txt
  9  methane.pdb
 12  ethane.pdb
 15  propane.pdb
 20  cubane.pdb
 21  pentane.pdb
 30  octane.pdb
107  total

We can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we've done that, we can run another command called head to get the first few lines in sorted-lengths.txt:

$ sort -n lengths.txt > sorted-lengths.txt
$ head -1 sorted-lengths.txt
  9  methane.pdb

head

Using the parameter -1 with head tells it that we only want the first line of the file; -20 would get the first 20, and so on. the head command with no parameters prints 10 lines.

$ head sorted-lengths.txt
   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total

If there are less than 10 lines in the file, it will print all the lines and stop.

tail

Tail is just like head, but returns the last lines of a file instead of the first. The tail command with no parameters prints the last 10 lines.

$ tail -1 sorted-lengths.txt
 107 total

The command uniq removes adjacent duplicated lines from its input. For example, if a file salmon.txt contains:

coho
coho
steelhead
coho
steelhead
steelhead

then uniq salmon.txt produces:

coho
steelhead
coho
steelhead

uniq

The command uniq removes adjacent duplicated lines from its input. For example, if a file salmon.txt contains:

coho
coho
steelhead
coho
steelhead
steelhead

then uniq salmon.txt produces:

coho
steelhead
coho
steelhead

pipe

Even once you understand what wc, sort, and head do, all those intermediate files make it hard to follow what's going on. We can make it easier to understand by running sort and head together:

$ sort -n lengths.txt | head -1
  9  methane.pdb

The vertical bar between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right. The computer might create a temporary file if it needs to, or copy data from one program to the other in memory, or something else entirely; we don't have to know or care.

We can use another pipe to send the output of wc directly to sort, which then sends its output to head:

$ wc -l *.pdb | sort -n | head -1
  9  methane.pdb

This is exactly like a mathematician nesting functions like log(3x) and saying "the log of three times x". In our case, the calculation is "head of sort of line count of *.pdb".

standard input and standard output (stdin and stdout)

Here's what actually happens behind the scenes when we create a pipe. When a computer runs a program --- any program --- it creates a process in memory to hold the program's software and its current state. Every process has an input channel called standard input. (By this point, you may be surprised that the name is so memorable, but don't worry: most Unix programmers call it "stdin". Every process also has a default output channel called standard output (or "stdout").

The shell is actually just another program. Under normal circumstances, whatever we type on the keyboard is sent to the shell on its standard input, and whatever it produces on standard output is displayed on our screen. When we tell the shell to run a program, it creates a new process and temporarily sends whatever we type on our keyboard to that process's standard input, and whatever the process sends to standard output to the screen.

Here's what happens when we run wc -l *.pdb > lengths.txt. The shell starts by telling the computer to create a new process to run the wc program. Since we've provided some filenames as parameters, wc reads from them instead of from standard input. And since we've used > to redirect output to a file, the shell connects the process's standard output to that file.

If we run wc -l *.pdb | sort -n instead, the shell creates two processes (one for each process in the pipe) so that wc and sort run simultaneously. The standard output of wc is fed directly to the standard input of sort; since there's no redirection with >, sort's output goes to the screen. And if we run wc -l *.pdb | sort -n | head -1, we get three processes with data flowing from the files, through wc to sort, and from sort through head to the screen.

This simple idea is why Unix has been so successful. Instead of creating enormous programs that try to do many different things, Unix programmers focus on creating lots of simple tools that each do one job well, and that work well with each other. This programming model is called "pipes and filters". We've already seen pipes; a filter is a program like wc or sort that transforms a stream of input into a stream of output. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from standard input, do something with what they've read, and write to standard output.

The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way as well. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.

Overwriting Output vs. Concatenating Output {.callout}

Using > to redirect a program's output will overwrite any existing file with the same name. To avoid this, you can use the >> operator. This will concatenate the output to the end of the file if it already exists.

Wget

There are two common tools allow you to download files from HTTP, HTTPS, FTP (i.e. the internet): one is wget, one is curl. We'll mostly use wget in this course.

	wget http://www.hardwoodgenomics.org/sites/default/files/gSSRs/white_oak_hqSSR_seqs.fasta

Man pages

If you need more information about a command (like what it does, what parameters it needs, curious about additional functionality), use the man command. man stands for "manual".

man wget

To exit, type the letter q. To scroll a single line use enter or the arrow keys. Also:

space bar = forward one page

f = forward one page

b = back one page

h = display help page

Search engines such as Google are another resource for looking up commands. ("Google Fu").

less

Another way to view a file other than cat is the command less. This works a lot like the man pages. Try out

less white_oak_hqSSR_seqs.fasta

The same commands (q to exit, f for forward one page, b for back one page, etc) will work while using less.

File Ownership and Permissions

All files and directories have an owner and a group. When you create a file or make a copy of a file to your own workspace, you are (usually) the owner. You can check this with ls -l.

To view permissions, you can also use the information from ls -l. This gives you the permissions in a 9 letter format, such as 'rwxrwxrwx' or 'r--r--r--'. The first three letters indicate permissions for the file owner (r for read, w for write and x for execute). The next three letters indicate the permissions for the group, and the final three letters for the world.

Lets add write permissions to the pdb files.

chmod a+w *pdb
ls -l

This is dangerous, as now anyone on newton could overwrite your files! Lets change it back to no write permissions.

chmod a-w *pdb
ls -l

If you need to edit a file, you can give only yourself (not everyone) write permissions

chmod u+w cubane.pdb
ls -l

Attribution

This material is made available under the Creative Commons Attribution license. Much of the material has been adapted from lessons by Software Carpentry.

Clone this wiki locally