Skip to content
Jonathan Morgan edited this page May 4, 2024 · 9 revisions

pandas notes

Table of Contents

DataFrames and Series

Basics

Column names

  • Get list of column names

      my_list = df.columns.values.tolist()
    

Data frame size

  • use data_frame.count()

    • count on data frame returns a Series with a count for each column, where the index for each value is the column name. The count is of non-NA rows within the column, so you can get different counts per column, and it is probably a good idea to not use this as a proxy for number of rows in a DataFrame unless you know the character of a particular column and it is a good proxy. Example:

        # get count() Series.
        total_count = test_df.count()
      
        # get count for column "hooha"
        total_count = total_count[ "hooha" ]
      
        # and convert it to an int.
        total_count = int( total_count )
      
  • use len( data_frame ) - gives the number of rows in the data frame.

Descriptive statistics - describe()

square brackets

whenever you see square brackets to reference one or more column names, or after a "loc" or "iloc" call:

  • If you see one set of square brackets ( "[]" ), whatever you are doing will result in a Series, or a vector, not a one-column data frame. Example:

      individual_period_info_df[ 'image_count' ]
    
  • If you see two sets of square brackets ("[[]]" ), whatever you are doing will result in a DataFrame.

      desc_df[[ "image_count", "ocr_count" ]]
      desc_df[[ "image_count" ]]               # one-column DataFrame, not a series.
      
      # to then get series for that one column
      single_column_df = desc_df[[ "image_count" ]]
      series = single_column_df[ "image_count" ]
    
      # OR
      series = desc_df[ "image_count" ]
    
      # NOTE: you can't do this:
      series = desc_df[ "image_count", "ocr_count" ]
    

DataFrame creation

Create DataFrame from list of dictionaries

To create a DataFrame from a list of dictionaries, pass list to DataFrame constructor:

individual_period_info_df = pandas.DataFrame( individual_period_info_list )

A column will be created for every label across all the dictionaries. If a particular label is only in some dictionaries, it will be added to the DataFrame, dictionaries that don't contain it will have that column set to "NaN".

DataFrame column selection

To select a single column from a DataFrame (example: "image_count" column):

image_count_column = desc_df.image_count              # as Series
image_count_column = desc_df.loc[ :, "image_count" ]  # as Series
image_count_column = desc_df[ "image_count" ]         # as Series
image_count_df = desc_df[[ "image_count" ]]           # as DataFrame (2 square-brackets)

# then you can call functions on them:
image_count_mean = image_count_column.mean()     # returns numeric value.
mean_series = image_count_df.mean()              # returns Series of mean values, with row index telling which column each mean value belongs to.
image_count_mean = mean_series[ "image_count" ]  # get mean for column "image_count" from the mean Series.

To select multiple columns from a DataFrame, stored as a DataFrame:

working with column values: https://stackoverflow.com/questions/47006617/finding-max-min-value-of-individual-columns

DataFrame subsetting

  • retrieve rows X through Y of a data frame - use .iloc[] and slice notation (0-indexed, start is included, end is excluded):

    • for example, retrieve rows 1 through 5000 of temp_df into test_df:

        test_df = temp_df.iloc[ 0 : 5000 ]
      

DataFrame filtering

  • To find rows in a DataFrame where a certain column = a certain value:

      temp_df = csv_df[ csv_df[ "column_name" ] == "column_value" ]
    
  • To filter using a regular expression:

      temp_df = csv_df[ csv_df[ "column_name" ].str.contains( r"<regex>" ) == True ]
    
    • example - find all rows with a date string patterned on "MMM-YY" ("May-01"):

        temp_df = csv_df[ csv_df[ "ici_start_date" ].str.contains( r"^[a-z,A-Z]+-\d+$" ) == True ]
      
  • use "ampersand" to join conditions, not "and".

  • just get specific columns:

    • Pass the "[]" operator a list of column names.  So, to retrieve just columns "A", "B", and "C" from DataFrame example_df:

        temp_df = example_df[ [ "A", "B", "C", ] ]
      
  • Get a single column:

    • include column name in "[]" operator.  So, to retrieve the values for column "A" from DataFrame example_df:

        column_values = example_df[ "A" ]
      
  • More information:

Combine DataFrames

Add an empty column

To add an empty column, inside square brackets, reference a column name that doesn't already exist and set it to a value:

df[ "<column_name>" ] = ""
df[ "<column_name>" ] = np.nan
# ... etc.

Drop a column in a DataFrame

  • The best way to do this in pandas is to use drop:

      df = df.drop( 'column_name', 1 )
    
  • where 1 is the axis number (0 for rows and 1 for columns.)

  • To delete the column without having to reassign df you can do:

      df.drop( 'column_name', axis = 1, inplace = True )
    
  • Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:

      df.drop( df.columns[ [ 0, 1, 3 ] ], axis = 1 )  # df.columns is zero-based pd.Index
    
  • from: http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe#13485766

Rename a column in a DataFrame

Selecting Rows that contain NaN or None (in Object dtype)

# convert any None to "-1".
compare_values_1.fillna( "-1" )
compare_values_1[ compare_values_1.isnull() ] = "-1"

Random sample from DataFrame

Basic random sample from DataFrame can use ".sample( n = <sample_size> )":

current_bin_sample_df = current_bin_df.sample( n = sample_this_many )

Can also sample a percentage of the total size of the data frame using "frac" parameter (50% sample):

current_bin_sample_df = current_bin_df.sample( frac = 0.5 )

More info:

Using iloc

  • iloc[] lets you filter a DataFrame or Series on values in the structure's index.  You can ask for a particular row by passing an index value.

  • use square brackets, not parentheses, to hold arguments to iloc (example of passing an index).

    • Yes:

        user_id = reliability_names_df_IN[ column_name ].iloc[ 0 ]
      
    • No:

        user_id = reliability_names_df_IN[ column_name ].iloc( 0 )
      

Using loc

  • loc[] lets you filter rows in a DataFrame or Series on values within the structure.  It accepts two arguments:

    • specification of which row(s) you want to target.  This could be a list of indexes, a Python slice statement, or it could be a conditional statement like you'd pass to an if statement.
    • specification of which column(s) within matching rows you want to target.  This could be a list of column names, or it could be a conditional statement like you'd pass to an if statement (though I'm honestly not sure how a filter here would work).
  • Use ".notnull()" to find just rows where a given column is not null:

      df.loc[ df[ "column_name" ].notnull() ]
    
  • Just columns 1, 2, and 3 where a given column is not null:

      df.loc[ df[ "column_name" ].notnull(), [ "column1", "column2", "column3" ] ]
    
  • Use ".isnull()" to find rows where a column's value is null:

      df.loc[ df[ "column_name" ].isnull() ]
    
  • example column "code_value", using results of .loc[] to update value of that column:

      df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
    
    • converts column code_value to int

        df[ "code_value" ].astype( int )
      
    • filters rows to only those where the value is <= 0:

        df[ "code_value" ].astype( int ) <= 0
      
    • after a comma, specify the columns you want to interact with in rows that fit your filter criteria (just "code_value", same column we are filtering on):

        df[ "code_value" ].astype( int ) <= 0, [ "code_value" ]
      
    • pass this all to .loc[] so it will do all the filtering of rows and targeting of columns:

        df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ]
      
    • then, assign "new_value" to the result of the .loc[] (the column "code_value" in each matching row):

        df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
      
  • another example of assigning a value to the filtering that loc lets you do:

      cleaned_data_frame.loc[ cleaned_data_frame[ "org_dept" ] == "ANESTHESIOLOGY", [ 'org_dept' ] ] = "MEDICINE"
    
  • add a column:

      df.loc[ :, "column_name" ] = column_values
    
  • pandas doc: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

Replace

# replace all rows with "nan" with ""
df[ column_name ].replace( to_replace = "nan", value = "", inplace = True )

Replace with regex

# For all values that start with "b'" and end with "'", strip
#     that stuff off.
df[ column_name ].replace( to_replace = r"^b'(.*)'$", value = r"\1", regex = True, inline = True )

Replace using loc

  • assigning a value to the filtering that loc lets you do:

      cleaned_data_frame.loc[ cleaned_data_frame[ \"org_dept\" ] == \"ANESTHESIOLOGY\", [ \'org_dept\' ] ] = \"MEDICINE\"
    
  • See "Using loc" above for more details. 

Import/export

DataFrame.to_csv()

df.to_csv( file_path, index = False, encoding = \"utf-8\")

Options for outputting CSV to table

Use psycopg2 copy_from:

import io

sqlalchemy_conn = sqlalchemy_engine.raw_connection()
sqlalchemy_cursor = sqlalchemy_conn.cursor()

# make string buffer
df_buf = io.StringIO()

# store data frame there as CSV
resdf.to_csv( df_buf )

# reset buffer to the beginning
df_buf.seek( 0 )

# write to table.
sqlalchemy_cursor.copy_from( df_buf, table_name, sep = "," )
sqlalchemy_conn.commit()  

Create INSERTS:

# df is the dataframe
if len(resdf) > 0:

    df_columns = list( resdf )

    # create (col1,col2,\...)
    columns = ",".join(df_columns)

    # create VALUES( '%s', '%s', ... ) one '%s' per column
    values = "VALUES( {} )".format( ",".join( [ "%s" for _ in df_columns ] ) )

    # create INSERT INTO table ( columns ) VALUES( '%s', ... )
    insert_stmt = "INSERT INTO {} ({}) {}".format( table_name, columns, values )

    print( "INSERT statement: " + insert_stmt )

    sqlalchemy_conn = sqlalchemy_engine.raw_connection()
    sqlalchemy_cursor = sqlalchemy_conn.cursor()
    psycopg2.extras.execute_batch( sqlalchemy_cursor, insert_stmt, resdf.values )
    sqlalchemy_conn.commit()
    sqlalchemy_cursor.close()

#-- END check if anything in data frame --#

Parsing strings

  • How to do left(), right(), and mid() in pandas:

  • to split a date string (MMDDYYYY) out into separate date part columns:

    • If column is an integer, convert it to string:

        temp_series = df[ "<column_name>" ].apply( str )
        # OR slower...
        temp_series = df[ "<column_name>" ].astype( str )
      
    • zero-pad to 8 digits:

        temp_series = temp_series.str.zfill( 8 )
      
    • parse out date parts:

      • month - left()-most 2 characters: df[ 'column_name' ].str[ : 2 ]
      • day - mid() 2 and 3: df[ 'column_name' ].str[ 2 : 4 ]
      • year - right()-most 4 characters: df[ 'column_name' ].str[ -4 : ]

Performance

dask

pandas and large data sets

function cleanData()

def cleanData( data_frame_IN, column_name_IN, filter_value_IN ):

    """
    Parameters
    ----------
    - data_frame_IN : A pandas DataFrame
    - column_name_IN : Name of the column on the dataframe
    - filter_value_IN : The value that causes rows to be filtered out of data_frame_IN
           if it is present in the column named column_name_IN.

    Returns
    -------
    - cleaned_data_OUT : A Pandas DataFrame containing only rows that did not have the
                specified value in the specified column
    """

    # return reference
    cleaned_data_OUT = None

    # check to make sure that column name is in data frame's list of column names.
    if( column_name_IN not in list( data_frame_IN.columns.values ) ):

        print( "ERROR : Column you specified not present in the dataframe" )
        clean_data_OUT = None

    #-- END check to see if column is in data frame's list of column names. --#

    if filter_value_IN.upper() == "NULL":

        # keep rows where column passed is not NULL.
        cleaned_data_OUT = data_frame_IN[ pandas.notnull( data_frame_IN[ column_name_IN ] ) == True ]

    else:

        # keep rows where column passed in does not contain filter_out_value_IN.
        cleaned_data_OUT = data_frame_IN[ data_frame_IN[ column_name_IN ] != filter_value_IN ]

    #-- END check for "NULL" --#

    return cleaned_data_OUT

#-- END function cleanData() --#

print( "function cleanData() defined at \" + str( datetime.datetime.now() ) )

Dummy-coding data

With scikit-learn

Cheat Sheet

Examples

# CONSTANTS-ish
INFO_COUNT = "count"
INFO_MIN = "min"
INFO_MAX = "max"
INFO_MEAN = "mean"
INFO_MEDIAN = "median"
INFO_STD = "std"
INFO_DESCRIBE = "describe"

def get_variable_info( df_IN, var_name_IN, debug_flag_IN = False ):
    
    # return reference
    info_OUT = None

    # declare variables
    me = "get_variable_info"
    debug_flag = None
    df_info_dict = None
    my_series = None
    my_count = None
    my_min = None
    my_max = None
    my_mean = None
    my_median = None
    my_std = None
    my_describe = None
    
    # init
    debug_flag = debug_flag_IN

    # get column as Series
    my_series = df_IN[ var_name_IN ]

    # retrieve information
    my_count = my_series.count()
    my_min = my_series.min()
    my_max = my_series.max()
    my_mean = my_series.mean()
    my_median = my_series.median()
    my_std = my_series.std()
    my_describe = my_series.describe()

    # create dictionary
    df_info_dict = {}
    df_info_dict[ INFO_COUNT ] = my_count
    df_info_dict[ INFO_MIN ] = my_min
    df_info_dict[ INFO_MAX ] = my_max
    df_info_dict[ INFO_MEAN ] = my_mean
    df_info_dict[ INFO_MEDIAN ] = my_median
    df_info_dict[ INFO_STD ] = my_std
    df_info_dict[ INFO_DESCRIBE ] = my_describe

    if ( debug_flag == True ):
    
        # render output
        status_string = "- count: {my_count}\n- min: {my_min}\n- max: {my_max}\n- mean: {my_mean}\n- median: {my_median}\n- std: {my_std}".format(
            my_count = df_info_dict.get( INFO_COUNT ),
            my_min = df_info_dict.get( INFO_MIN ),
            my_max = df_info_dict.get( INFO_MAX ),
            my_mean = df_info_dict.get( INFO_MEAN ),
            my_median = df_info_dict.get( INFO_MEDIAN ),
            my_std = df_info_dict.get( INFO_STD )
        )

        print( status_string )
        
    #-- END debug --#
    
    info_OUT = df_info_dict
    return info_OUT
    
#-- END function get_variable_info() --#

print( "function get_variable_info() defined at {}".format( datetime.datetime.now() ) )

# CONSTANTS-ish
ROUND_UP = "up"
ROUND_DOWN = "down"

# loop, grabbing subsequent data frame subsets of size bin_size
#     each time through, until last bin, then grab from offset
#     to the end.

def process_bins( df_IN, bin_count_IN, sample_size_IN, do_round_IN = ROUND_UP, debug_flag_IN = False ):

    # return reference
    sample_df_OUT = None
    
    # declare variables
    me = "process_bins"
    debug_flag = None
    status_message = None
    
    # declare variables - bin size
    df_row_count = None
    bin_count = None
    bin_size_exact = None
    bin_size = None
    do_round = None
    
    # declare variables - sample per bin
    sample_size = None
    sample_per_bin_exact = None
    sample_per_bin = None
    
    # declare variables - process bins
    bin_index = None
    bin_number = None
    df_offset = None
    df_start_index = None
    df_end_index = None
    current_bin_df = None
    current_bin_info = None
    sample_count = None
    sample_this_many = None
    current_bin_sample_df = None

    # init
    debug_flag = debug_flag_IN
    df_row_count = len( df_IN )
    bin_count = bin_count_IN
    sample_size = sample_size_IN
    do_round = do_round_IN
    sample_count = 0

    #==========================================================================#
    # bin size
    #==========================================================================#

    # divide the row count by bin_count to get bin size
    bin_size_exact = df_row_count / bin_count

    # round up or round down?
    if ( do_round == ROUND_UP ):

        # use math.ceil() to round up.
        bin_size = math.ceil( bin_size_exact )

    else:

        # either ROUND_DOWN or none set.
        bin_size = math.floor( bin_size_exact )

    #-- END check to see how we round. --#

    # finally, convert to integer.
    bin_size = int( bin_size )

    if ( debug_flag == True ):
        status_message = "break {row_count} rows into {bin_count} bins: bin size = {bin_size} ( exact: {exact_size} )".format(
            row_count = df_row_count,
            bin_count = bin_count,
            bin_size = bin_size,
            exact_size = bin_size_exact
        )
        print( status_message )
    #-- END DEBUG --#
    
    #==========================================================================#
    # sample per bin
    #==========================================================================#
    
    # divide sample size by bin count to get number we select per bin
    sample_per_bin_exact = sample_size / bin_count
    
    # for decimals, round down, so we oversample from last bin (outliers).
    sample_per_bin = math.floor( sample_per_bin_exact )

    # finally, convert to integer.
    sample_per_bin = int( sample_per_bin )

    if ( debug_flag == True ):
        status_message = "sample {per_bin} rows from each of {bin_count} bins ( exact: {exact_size} )".format(
            per_bin = sample_per_bin,
            bin_count = bin_count,
            exact_size = sample_per_bin_exact
        )
        print( status_message )
    #-- END DEBUG --#

    #==========================================================================#
    # process bins
    #==========================================================================#

    # loop over bins
    sample_count = 0
    for bin_index in range( 0, bin_count ):

        # init - calculate offset
        df_offset = bin_index * bin_size
        df_start_index = df_offset
        df_end_index = df_offset + bin_size

        # are we at the end?
        bin_number = bin_index + 1

        if ( debug_flag == True ):
            status_message = "\n\n----> BIN #{bin_number} - offset: {my_offset}; start: {start_index}; end: {end_index}; bin index: {my_index}".format(
                bin_number = bin_number,
                my_offset = df_offset,
                start_index = df_start_index,
                end_index = df_end_index,
                my_index = bin_index
            )
            print( status_message )
        #-- END DEBUG --#

        if ( bin_number == bin_count ):

            # last bin - get from offset to the end.
            current_bin_df = df_IN[ df_start_index : ]
            
            # sample as many as needed to get to total.
            sample_this_many = sample_size - sample_count

        else:

            # not last bin - get next bin_size past offset.
            current_bin_df = df_IN[ df_start_index : df_end_index ]
            
            # sample as many as needed from current bin...
            sample_this_many = sample_per_bin
            
        #-- END check if last bin or not --#

        # sample
        current_bin_sample_df = current_bin_df.sample( n = sample_this_many )
        print( "\n\nSample from bin #{}".format( bin_number ) )
        print( current_bin_sample_df )
        
        # first bin?
        if ( sample_df_OUT is None ):
            
            # first bin - just store the sample DataFrame
            sample_df_OUT = current_bin_sample_df
            
        else:
            
            # append sample DataFrame to output DataFrame
            sample_df_OUT = sample_df_OUT.append( current_bin_sample_df, ignore_index = True )
            
        #-- END check if first bin or not. --#
        
        # update total.
        sample_count += sample_this_many
        
    #-- END loop over bins --#
    
    return sample_df_OUT
    
#-- END function process_bins() --#

print( "function process_bins() defined at {}".format( datetime.datetime.now() ) )