-
Notifications
You must be signed in to change notification settings - Fork 0
pandas notes
pandas notes
-
indexing: http://pandas.pydata.org/pandas-docs/stable/indexing.htm
-
good examples: https://github.com/CSSIP-AIR/Big-Data-Workbooks/blob/master/07.%20Machine%20Learning/Machine%20Learning.ipynb
-
find unique values in a column:
#List unique values in the df[ 'name' ] column df.name.unique()
-
Get list of column names
my_list = df.columns.values.tolist()
-
use
data_frame.count()
-
count on data frame returns a Series with a count for each column, where the index for each value is the column name. The count is of non-NA rows within the column, so you can get different counts per column, and it is probably a good idea to not use this as a proxy for number of rows in a DataFrame unless you know the character of a particular column and it is a good proxy. Example:
# get count() Series. total_count = test_df.count() # get count for column "hooha" total_count = total_count[ "hooha" ] # and convert it to an int. total_count = int( total_count )
-
-
use
len( data_frame )
- gives the number of rows in the data frame.
-
how to pull individual items out of "
describe()
" output: -
Options:
-
if you
describe()
a data frame, you'll get back a data frame with column names as the column index, then the names of the descriptive statistics as the row indexes. For example:In [12]: individual_period_info_df.describe() Out[12]: image_count ocr_count count 76.000000 76.000000 mean 2034.710526 10124.578947 std 326.577354 376.020292 min 1404.000000 7401.000000 25% 1912.750000 10078.750000 50% 2022.500000 10206.500000 75% 2095.000000 10281.250000 max 4431.000000 10416.000000 # can describe one column and have it come back as DataFrame, also, if you use two square brackets when selecting column: desc_output_df = individual_period_info_df[[ 'image_count' ]].describe()
More details: square brackets
To get "
image_count
" mean:desc_df = individual_period_info_df.describe() image_count_column = desc_df[ "image_count" ] image_count_column = desc_df.loc[ :, "image_count" ] image_count_mean = image_count_column [ "mean" ]
-
if you
describe()
a Series, you'll get back a series wit no column index, and the names of descriptive statistics as row indexes. For example:In [13]: individual_period_info_df[ 'image_count' ].describe() Out[13]: count 76.000000 mean 2034.710526 std 326.577354 min 1404.000000 25% 1912.750000 50% 2022.500000 75% 2095.000000 max 4431.000000 Name: image_count, dtype: float64
To get "
image_count
" mean:desc_series = individual_period_info_df[ 'image_count' ].describe() image_count_mean = desc_series[ "mean" ]
-
-
documentation:
- describe() function doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
- statistics: https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html
- column mean!: https://stackoverflow.com/questions/31037298/pandas-get-column-average-mean
whenever you see square brackets to reference one or more column names, or after a "loc
" or "iloc
" call:
-
If you see one set of square brackets ( "
[]
" ), whatever you are doing will result in a Series, or a vector, not a one-column data frame. Example:individual_period_info_df[ 'image_count' ]
-
If you see two sets of square brackets ("
[[]]
" ), whatever you are doing will result in a DataFrame.desc_df[[ "image_count", "ocr_count" ]] desc_df[[ "image_count" ]] # one-column DataFrame, not a series. # to then get series for that one column single_column_df = desc_df[[ "image_count" ]] series = single_column_df[ "image_count" ] # OR series = desc_df[ "image_count" ] # NOTE: you can't do this: series = desc_df[ "image_count", "ocr_count" ]
To create a DataFrame from a list of dictionaries, pass list to DataFrame constructor:
individual_period_info_df = pandas.DataFrame( individual_period_info_list )
A column will be created for every label across all the dictionaries. If a particular label is only in some dictionaries, it will be added to the DataFrame, dictionaries that don't contain it will have that column set to "NaN".
To select a single column from a DataFrame (example: "image_count
" column):
image_count_column = desc_df.image_count # as Series
image_count_column = desc_df.loc[ :, "image_count" ] # as Series
image_count_column = desc_df[ "image_count" ] # as Series
image_count_df = desc_df[[ "image_count" ]] # as DataFrame (2 square-brackets)
# then you can call functions on them:
image_count_mean = image_count_column.mean() # returns numeric value.
mean_series = image_count_df.mean() # returns Series of mean values, with row index telling which column each mean value belongs to.
image_count_mean = mean_series[ "image_count" ] # get mean for column "image_count" from the mean Series.
To select multiple columns from a DataFrame, stored as a DataFrame:
- `desc_df'image_count', 'ocr_count' # must use 2 square brackets, to get data frame.
working with column values: https://stackoverflow.com/questions/47006617/finding-max-min-value-of-individual-columns
-
retrieve rows X through Y of a data frame - use
.iloc[]
and slice notation (0-indexed, start is included, end is excluded):-
for example, retrieve rows 1 through 5000 of temp_df into test_df:
test_df = temp_df.iloc[ 0 : 5000 ]
-
-
To find rows in a DataFrame where a certain column = a certain value:
temp_df = csv_df[ csv_df[ "column_name" ] == "column_value" ]
-
To filter using a regular expression:
temp_df = csv_df[ csv_df[ "column_name" ].str.contains( r"<regex>" ) == True ]
-
example - find all rows with a date string patterned on "MMM-YY" ("May-01"):
temp_df = csv_df[ csv_df[ "ici_start_date" ].str.contains( r"^[a-z,A-Z]+-\d+$" ) == True ]
-
-
use "ampersand" to join conditions, not "and".
-
This works:
person_df_lookup = person_df_IN[ ( person_df_IN[ column_name_person_id_1 ] > 0 ) & ( person_df_IN[ column_name_person_id_2 ] > 0 ) ]
-
this does not:
person_df_lookup = person_df_IN[ ( person_df_IN[ column_name_person_id_1 ] > 0 ) and ( person_df_IN[ column_name_person_id_2 ] > 0 ) ]
-
-
just get specific columns:
-
Pass the "[]" operator a list of column names. So, to retrieve just columns "A", "B", and "C" from DataFrame example_df:
temp_df = example_df[ [ "A", "B", "C", ] ]
-
-
Get a single column:
-
include column name in "[]" operator. So, to retrieve the values for column "A" from DataFrame example_df:
column_values = example_df[ "A" ]
-
-
More information:
-
use append() to tack rows onto the end of an existing DataFrame:
df.append(df2, ignore_index=True)
-
Keeps all columns present in either data frame.
-
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
-
-
OR concat():
df_merged = pandas.concat( [ df1, df2 ], ignore_index = True )
-
once you are done combining, you can reset indices to the order of rows in the DataFrame using
reset_index( inplace = True )
. By default, this will move the existing index into a column named "level_0
", then create new index from 0 to row count - 1, with no holes or inconsistencies. Example:volume_df.reset_index( inplace = True )
-
more ways to merge and combine: http://pandas.pydata.org/pandas-docs/stable/merging.html
To add an empty column, inside square brackets, reference a column name that doesn't already exist and set it to a value:
df[ "<column_name>" ] = ""
df[ "<column_name>" ] = np.nan
# ... etc.
-
The best way to do this in pandas is to use drop:
df = df.drop( 'column_name', 1 )
-
where 1 is the axis number (0 for rows and 1 for columns.)
-
To delete the column without having to reassign df you can do:
df.drop( 'column_name', axis = 1, inplace = True )
-
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
df.drop( df.columns[ [ 0, 1, 3 ] ], axis = 1 ) # df.columns is zero-based pd.Index
-
from: http://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe#13485766
-
Use the rename function and refer the columns to be renamed. Not all the columns have to be renamed:
df = df.rename( columns = { 'oldName1': 'newName1', 'oldName2': 'newName2' } ) # OR df.rename( columns = { 'oldName1' : 'newName1', 'oldName2': 'newName2' }, inplace = True )
-
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html
-
from http://stackoverflow.com/questions/11346283/renaming-columns-in-pandas#11354850
# convert any None to "-1".
compare_values_1.fillna( "-1" )
compare_values_1[ compare_values_1.isnull() ] = "-1"
Basic random sample from DataFrame can use ".sample( n = <sample_size> )
":
current_bin_sample_df = current_bin_df.sample( n = sample_this_many )
Can also sample a percentage of the total size of the data frame using "frac" parameter (50% sample):
current_bin_sample_df = current_bin_df.sample( frac = 0.5 )
More info:
- https://www.geeksforgeeks.org/how-to-randomly-select-rows-from-pandas-dataframe/
- https://stackoverflow.com/questions/15923826/random-row-selection-in-pandas-dataframe
-
iloc[] lets you filter a DataFrame or Series on values in the structure's index. You can ask for a particular row by passing an index value.
-
use square brackets, not parentheses, to hold arguments to iloc (example of passing an index).
-
Yes:
user_id = reliability_names_df_IN[ column_name ].iloc[ 0 ]
-
No:
user_id = reliability_names_df_IN[ column_name ].iloc( 0 )
-
-
loc[] lets you filter rows in a DataFrame or Series on values within the structure. It accepts two arguments:
- specification of which row(s) you want to target. This could be a list of indexes, a Python slice statement, or it could be a conditional statement like you'd pass to an if statement.
- specification of which column(s) within matching rows you want to target. This could be a list of column names, or it could be a conditional statement like you'd pass to an if statement (though I'm honestly not sure how a filter here would work).
-
Use ".notnull()" to find just rows where a given column is not null:
df.loc[ df[ "column_name" ].notnull() ]
-
Just columns 1, 2, and 3 where a given column is not null:
df.loc[ df[ "column_name" ].notnull(), [ "column1", "column2", "column3" ] ]
-
Use ".isnull()" to find rows where a column's value is null:
df.loc[ df[ "column_name" ].isnull() ]
-
example column "code_value", using results of .loc[] to update value of that column:
df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
-
converts column code_value to int
df[ "code_value" ].astype( int )
-
filters rows to only those where the value is <= 0:
df[ "code_value" ].astype( int ) <= 0
-
after a comma, specify the columns you want to interact with in rows that fit your filter criteria (just "code_value", same column we are filtering on):
df[ "code_value" ].astype( int ) <= 0, [ "code_value" ]
-
pass this all to .loc[] so it will do all the filtering of rows and targeting of columns:
df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ]
-
then, assign "new_value" to the result of the
.loc[]
(the column "code_value" in each matching row):df.loc[ df[ "code_value" ].astype( int ) <= 0, [ "code_value" ] ] = "new_value"
-
-
another example of assigning a value to the filtering that loc lets you do:
cleaned_data_frame.loc[ cleaned_data_frame[ "org_dept" ] == "ANESTHESIOLOGY", [ 'org_dept' ] ] = "MEDICINE"
-
add a column:
df.loc[ :, "column_name" ] = column_values
-
pandas doc: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label
# replace all rows with "nan" with ""
df[ column_name ].replace( to_replace = "nan", value = "", inplace = True )
# For all values that start with "b'" and end with "'", strip
# that stuff off.
df[ column_name ].replace( to_replace = r"^b'(.*)'$", value = r"\1", regex = True, inline = True )
-
assigning a value to the filtering that loc lets you do:
cleaned_data_frame.loc[ cleaned_data_frame[ \"org_dept\" ] == \"ANESTHESIOLOGY\", [ \'org_dept\' ] ] = \"MEDICINE\"
-
See "Using loc" above for more details.
df.to_csv( file_path, index = False, encoding = \"utf-8\")
Use psycopg2 copy_from:
import io
sqlalchemy_conn = sqlalchemy_engine.raw_connection()
sqlalchemy_cursor = sqlalchemy_conn.cursor()
# make string buffer
df_buf = io.StringIO()
# store data frame there as CSV
resdf.to_csv( df_buf )
# reset buffer to the beginning
df_buf.seek( 0 )
# write to table.
sqlalchemy_cursor.copy_from( df_buf, table_name, sep = "," )
sqlalchemy_conn.commit()
Create INSERTS:
# df is the dataframe
if len(resdf) > 0:
df_columns = list( resdf )
# create (col1,col2,\...)
columns = ",".join(df_columns)
# create VALUES( '%s', '%s', ... ) one '%s' per column
values = "VALUES( {} )".format( ",".join( [ "%s" for _ in df_columns ] ) )
# create INSERT INTO table ( columns ) VALUES( '%s', ... )
insert_stmt = "INSERT INTO {} ({}) {}".format( table_name, columns, values )
print( "INSERT statement: " + insert_stmt )
sqlalchemy_conn = sqlalchemy_engine.raw_connection()
sqlalchemy_cursor = sqlalchemy_conn.cursor()
psycopg2.extras.execute_batch( sqlalchemy_cursor, insert_stmt, resdf.values )
sqlalchemy_conn.commit()
sqlalchemy_cursor.close()
#-- END check if anything in data frame --#
-
How to do left(), right(), and mid() in pandas:
- left()-most 2 characters:
df[ 'column_name' ].str[ : 2 ]
- right()-most 4 characters:
df[ 'column_name' ].str[ -4 : ]
- mid() 2 and 3:
df[ 'column_name' ].str[ 2 : 4 ]
- http://stackoverflow.com/questions/20970279/how-to-do-a-left-right-and-mid-of-a-string-in-a-pandas-dataframe#20970328
- left()-most 2 characters:
-
to split a date string (MMDDYYYY) out into separate date part columns:
-
If column is an integer, convert it to string:
temp_series = df[ "<column_name>" ].apply( str ) # OR slower... temp_series = df[ "<column_name>" ].astype( str )
-
zero-pad to 8 digits:
temp_series = temp_series.str.zfill( 8 )
-
parse out date parts:
- month - left()-most 2 characters: df[ 'column_name' ].str[ : 2 ]
- day - mid() 2 and 3: df[ 'column_name' ].str[ 2 : 4 ]
- year - right()-most 4 characters: df[ 'column_name' ].str[ -4 : ]
-
- Process large datasets without running out of memory: https://pythonspeed.com/memory/
-
alternative to pandas that is not so dependent on memory.
-
Links:
-
Wes McKinney - pandas creator:
-
Pandas and apache arrow: http://wesmckinney.com/blog/pandas-and-apache-arrow/
-
overview of issues with pandas from its creator: http://wesmckinney.com/blog/apache-arrow-pandas-internals/
-
using pandas with big data from its creator: https://www.dataquest.io/blog/pandas-big-data/
-
apache arrow:
-
-
Dataquest:
-
a few basic pandas memory hints: https://stackoverflow.com/questions/31886939/reduce-memory-usage-pandas#31888262
-
Not sure how potentially good this is: http://www.hilpisch.com/YH_In_Memory_Analytics_with_pandas.html/
-
basic tips, including dictionaries over functions for recoding and categorical variables: https://michaelinkles.wordpress.com/2016/04/17/reducing-memory-usage-in-pandas/
-
big data with pandas and sqlite: https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/
def cleanData( data_frame_IN, column_name_IN, filter_value_IN ):
"""
Parameters
----------
- data_frame_IN : A pandas DataFrame
- column_name_IN : Name of the column on the dataframe
- filter_value_IN : The value that causes rows to be filtered out of data_frame_IN
if it is present in the column named column_name_IN.
Returns
-------
- cleaned_data_OUT : A Pandas DataFrame containing only rows that did not have the
specified value in the specified column
"""
# return reference
cleaned_data_OUT = None
# check to make sure that column name is in data frame's list of column names.
if( column_name_IN not in list( data_frame_IN.columns.values ) ):
print( "ERROR : Column you specified not present in the dataframe" )
clean_data_OUT = None
#-- END check to see if column is in data frame's list of column names. --#
if filter_value_IN.upper() == "NULL":
# keep rows where column passed is not NULL.
cleaned_data_OUT = data_frame_IN[ pandas.notnull( data_frame_IN[ column_name_IN ] ) == True ]
else:
# keep rows where column passed in does not contain filter_out_value_IN.
cleaned_data_OUT = data_frame_IN[ data_frame_IN[ column_name_IN ] != filter_value_IN ]
#-- END check for "NULL" --#
return cleaned_data_OUT
#-- END function cleanData() --#
print( "function cleanData() defined at \" + str( datetime.datetime.now() ) )
- pandas.get_dummies(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
- train-test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# CONSTANTS-ish
INFO_COUNT = "count"
INFO_MIN = "min"
INFO_MAX = "max"
INFO_MEAN = "mean"
INFO_MEDIAN = "median"
INFO_STD = "std"
INFO_DESCRIBE = "describe"
def get_variable_info( df_IN, var_name_IN, debug_flag_IN = False ):
# return reference
info_OUT = None
# declare variables
me = "get_variable_info"
debug_flag = None
df_info_dict = None
my_series = None
my_count = None
my_min = None
my_max = None
my_mean = None
my_median = None
my_std = None
my_describe = None
# init
debug_flag = debug_flag_IN
# get column as Series
my_series = df_IN[ var_name_IN ]
# retrieve information
my_count = my_series.count()
my_min = my_series.min()
my_max = my_series.max()
my_mean = my_series.mean()
my_median = my_series.median()
my_std = my_series.std()
my_describe = my_series.describe()
# create dictionary
df_info_dict = {}
df_info_dict[ INFO_COUNT ] = my_count
df_info_dict[ INFO_MIN ] = my_min
df_info_dict[ INFO_MAX ] = my_max
df_info_dict[ INFO_MEAN ] = my_mean
df_info_dict[ INFO_MEDIAN ] = my_median
df_info_dict[ INFO_STD ] = my_std
df_info_dict[ INFO_DESCRIBE ] = my_describe
if ( debug_flag == True ):
# render output
status_string = "- count: {my_count}\n- min: {my_min}\n- max: {my_max}\n- mean: {my_mean}\n- median: {my_median}\n- std: {my_std}".format(
my_count = df_info_dict.get( INFO_COUNT ),
my_min = df_info_dict.get( INFO_MIN ),
my_max = df_info_dict.get( INFO_MAX ),
my_mean = df_info_dict.get( INFO_MEAN ),
my_median = df_info_dict.get( INFO_MEDIAN ),
my_std = df_info_dict.get( INFO_STD )
)
print( status_string )
#-- END debug --#
info_OUT = df_info_dict
return info_OUT
#-- END function get_variable_info() --#
print( "function get_variable_info() defined at {}".format( datetime.datetime.now() ) )
# CONSTANTS-ish
ROUND_UP = "up"
ROUND_DOWN = "down"
# loop, grabbing subsequent data frame subsets of size bin_size
# each time through, until last bin, then grab from offset
# to the end.
def process_bins( df_IN, bin_count_IN, sample_size_IN, do_round_IN = ROUND_UP, debug_flag_IN = False ):
# return reference
sample_df_OUT = None
# declare variables
me = "process_bins"
debug_flag = None
status_message = None
# declare variables - bin size
df_row_count = None
bin_count = None
bin_size_exact = None
bin_size = None
do_round = None
# declare variables - sample per bin
sample_size = None
sample_per_bin_exact = None
sample_per_bin = None
# declare variables - process bins
bin_index = None
bin_number = None
df_offset = None
df_start_index = None
df_end_index = None
current_bin_df = None
current_bin_info = None
sample_count = None
sample_this_many = None
current_bin_sample_df = None
# init
debug_flag = debug_flag_IN
df_row_count = len( df_IN )
bin_count = bin_count_IN
sample_size = sample_size_IN
do_round = do_round_IN
sample_count = 0
#==========================================================================#
# bin size
#==========================================================================#
# divide the row count by bin_count to get bin size
bin_size_exact = df_row_count / bin_count
# round up or round down?
if ( do_round == ROUND_UP ):
# use math.ceil() to round up.
bin_size = math.ceil( bin_size_exact )
else:
# either ROUND_DOWN or none set.
bin_size = math.floor( bin_size_exact )
#-- END check to see how we round. --#
# finally, convert to integer.
bin_size = int( bin_size )
if ( debug_flag == True ):
status_message = "break {row_count} rows into {bin_count} bins: bin size = {bin_size} ( exact: {exact_size} )".format(
row_count = df_row_count,
bin_count = bin_count,
bin_size = bin_size,
exact_size = bin_size_exact
)
print( status_message )
#-- END DEBUG --#
#==========================================================================#
# sample per bin
#==========================================================================#
# divide sample size by bin count to get number we select per bin
sample_per_bin_exact = sample_size / bin_count
# for decimals, round down, so we oversample from last bin (outliers).
sample_per_bin = math.floor( sample_per_bin_exact )
# finally, convert to integer.
sample_per_bin = int( sample_per_bin )
if ( debug_flag == True ):
status_message = "sample {per_bin} rows from each of {bin_count} bins ( exact: {exact_size} )".format(
per_bin = sample_per_bin,
bin_count = bin_count,
exact_size = sample_per_bin_exact
)
print( status_message )
#-- END DEBUG --#
#==========================================================================#
# process bins
#==========================================================================#
# loop over bins
sample_count = 0
for bin_index in range( 0, bin_count ):
# init - calculate offset
df_offset = bin_index * bin_size
df_start_index = df_offset
df_end_index = df_offset + bin_size
# are we at the end?
bin_number = bin_index + 1
if ( debug_flag == True ):
status_message = "\n\n----> BIN #{bin_number} - offset: {my_offset}; start: {start_index}; end: {end_index}; bin index: {my_index}".format(
bin_number = bin_number,
my_offset = df_offset,
start_index = df_start_index,
end_index = df_end_index,
my_index = bin_index
)
print( status_message )
#-- END DEBUG --#
if ( bin_number == bin_count ):
# last bin - get from offset to the end.
current_bin_df = df_IN[ df_start_index : ]
# sample as many as needed to get to total.
sample_this_many = sample_size - sample_count
else:
# not last bin - get next bin_size past offset.
current_bin_df = df_IN[ df_start_index : df_end_index ]
# sample as many as needed from current bin...
sample_this_many = sample_per_bin
#-- END check if last bin or not --#
# sample
current_bin_sample_df = current_bin_df.sample( n = sample_this_many )
print( "\n\nSample from bin #{}".format( bin_number ) )
print( current_bin_sample_df )
# first bin?
if ( sample_df_OUT is None ):
# first bin - just store the sample DataFrame
sample_df_OUT = current_bin_sample_df
else:
# append sample DataFrame to output DataFrame
sample_df_OUT = sample_df_OUT.append( current_bin_sample_df, ignore_index = True )
#-- END check if first bin or not. --#
# update total.
sample_count += sample_this_many
#-- END loop over bins --#
return sample_df_OUT
#-- END function process_bins() --#
print( "function process_bins() defined at {}".format( datetime.datetime.now() ) )